Qumin: Quantitative modelling of inflection¶
Qumin (QUantitative Modelling of INflection) is a collection of scripts for the computational modelling of the inflectional morphology of languages. It was developed by me (Sacha Beniamine) for my PhD, which was supervised by Olivier Bonami .
The documentation has moved to ReadTheDocs at: https://qumin.readthedocs.io/
For more detail, you can refer to my dissertation (in French):
Quick Start¶
Install¶
First, open the terminal and navigate to the folder where you want the Qumin code. Clone the repository from github:
git clone https://github.com/XachaB/Qumin.git
Make sure to have all the python dependencies installed. The dependencies are listed in environment.yml. A simple solution is to use conda and create a new environment from the environment.yml file:
conda env create -f environment.yml
There is now a new conda environment named Qumin. It needs to be activated before using any Qumin script:
conda activate Qumin
Data¶
The scripts expect full paradigm data in phonemic transcription, as well as a feature key for the transcription.
To provide a data sample in the correct format, Qumin includes a subset of the French flexique lexicon, distributed under a Creative Commons Attribution-NonCommercial-ShareAlike license.
For Russian nouns, see the Inflected lexicon of Russian Nouns in IPA notation.
Scripts¶
Patterns¶
Alternation patterns serve as a basis for all the other scripts. The algorithm to find the patterns was presented in: Sacha Beniamine. Un algorithme universel pour l’abstraction automatique d’alternances morphophonologiques 24e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), Jun 2017, Orléans, France. 2 (2017), 24e Conférence sur le Traitement Automatique des Langues Naturelles.
Computing automatically aligned patterns for paradigm entropy or macroclass:
bin/$ python3 find_patterns.py <paradigm.csv> <segments.csv>
Computing automatically aligned patterns for lattices:
bin/$ python3 find_patterns.py -d -o <paradigm.csv> <segments.csv>
Microclasses¶
To visualize the microclasses and their similarities, you can use the new script microclass_heatmap.py:
Computing a microclass heatmap:
bin/$ python3 microclass_heatmap.py <paradigm.csv> <output_path>
Computing a microclass heatmap, comparing with class labels:
bin/$ python3 microclass_heatmap.py -l <labels.csv> -- <paradigm.csv> <output_path>
The labels file is a csv file. The first column give lexemes names, the second column provides inflection class labels. This allows to visually compare a manual classification with pattern-based similarity. This script relies heavily on seaborn’s clustermap function.
Paradigm entropy¶
This script was used in:
- Bonami, Olivier, and S. Beniamine. “Joint predictiveness in inflectional paradigms.” Word Structure 9, no. 2 (2016): 156-182. Some improvements have been implemented since then.
Computing entropies from one cell
bin/$ python3 calc_paradigm_entropy.py -n 1 -- <patterns.csv> <paradigm.csv> <segments.csv>
Computing entropies from two cells (you can specify any number of predictors, e.g. -n 1 2 3 works too)
bin/$ python3 calc_paradigm_entropy.py -n 2 -- <patterns.csv> <paradigm.csv> <segments.csv>
Add a file with features to help prediction (for example gender – features will be added to the known information when predicting)
bin/$ python3 calc_paradigm_entropy.py -n 2 --features <features.csv> -- <patterns.csv> <paradigm.csv> <segments.csv>
Macroclass inference¶
Our work on automatical inference of macroclasses was published in Beniamine, Sacha, Olivier Bonami, and Benoît Sagot. “Inferring Inflection Classes with Description Length.” Journal of Language Modelling (2018).
Inferring macroclasses
bin/$ python3 find_macroclasses.py <patterns.csv> <segments.csv>
Lattices¶
This script was used in:
- Beniamine, Sacha. (in press) “One lexeme, many classes: inflection class systems as lattices” , In: One-to-Many Relations in Morphology, Syntax and Semantics , Ed. by Berthold Crysmann and Manfred Sailer. Berlin: Language Science Press.
Inferring a lattice of inflection classes, with html output
bin/$ python3 make_lattice.py --html <patterns.csv> <segments.csv>
Documentation index¶
The morphological paradigms file¶
This file relates phonological forms to their lexemes and paradigm cells. As an example of valid data, Qumin is shipped with a paradigm table from the French inflectional lexicon Flexique. Here is a sample of the first 10 columns for 10 randomly picked verbs from Flexique:
lexeme variants prs.1sg prs.2sg prs.3sg prs.1pl prs.2pl prs.3pl ipfv.1sg ipfv.2sg ipfv.3sg peler peler pɛl pɛl pɛl pəlɔ̃ pəle pɛl pəlE pəlE pəlE soudoyer soudoyer sudwa sudwa sudwa sudwajɔ̃ sudwaje sudwa sudwajE sudwajE sudwajE inféoder inféoder ɛ̃fEɔd ɛ̃fEɔd ɛ̃fEɔd ɛ̃fEOdɔ̃ ɛ̃fEOde ɛ̃fEɔd ɛ̃fEOdE ɛ̃fEOdE ɛ̃fEOdE débiller débiller dEbij dEbij dEbij dEbijɔ̃ dEbije dEbij dEbijE dEbijE dEbijE désigner désigner dEziɲ dEziɲ dEziɲ dEziɲɔ̃ dEziɲe dEziɲ dEziɲE dEziɲE dEziɲE crachoter crachoter kʁaʃɔt kʁaʃɔt kʁaʃɔt kʁaʃOtɔ̃ kʁaʃOte kʁaʃɔt kʁaʃOtE kʁaʃOtE kʁaʃOtE saouler saouler:soûler sul sul sul sulɔ̃ sule sul sulE sulE sulE caserner caserner kazɛʁn kazɛʁn kazɛʁn kazɛʁnɔ̃ kazɛʁne kazɛʁn kazɛʁnE kazɛʁnE kazɛʁnE parrainer parrainer paʁɛn paʁɛn paʁɛn paʁEnɔ̃ paʁEne paʁɛn paʁEnE paʁEnE paʁEnE souscrire souscrire suskʁi suskʁi suskʁi suskʁivɔ̃ suskʁive suskʁiv suskʁivE suskʁivE suskʁivE
Paradigm files are written in wide format:
- each row represents a lexeme, and each column represents a cell.
- The first column indicates a unique identifier for each lexeme. It is usually convenient to use orthographic citation forms for this purpose (e.g. infinitive for verbs).
- In Vlexique, there is a second column with orthographic variants for lexeme names, which is called “variants”. You do not need to add a “variant” column, and if it is there, it will be ignored.
- the very first row indicates the names of the cells as column headers. Columns headers shouldn’t contain the character “#”.
While Qumin assumes that inflected forms are written in some phonemic notation (we suggest to be as close to the IPA as possible), you do not need to explicitely segment them into phonemes in the paradigms file.
The file itself is a csv
, meaning that the values are written as plain text, in utf-8 format, separated by spaces. This format can be read by spreadsheet programs as well as programmatically:
%%sh
head -n 3 "../Data/Vlexique/vlexique-20171031.csv"
lexeme,variants,prs.1sg,prs.2sg,prs.3sg,prs.1pl,prs.2pl,prs.3pl,ipfv.1sg,ipfv.2sg,ipfv.3sg,ipfv.1pl,ipfv.2pl,ipfv.3pl,fut.1sg,fut.2sg,fut.3sg,fut.1pl,fut.2pl,fut.3pl,cond.1sg,cond.2sg,cond.3sg,cond.1pl,cond.2pl,cond.3pl,sbjv.1sg,sbjv.2sg,sbjv.3sg,sbjv.1pl,sbjv.2pl,sbjv.3pl,pst.1sg,pst.2sg,pst.3sg,pst.1pl,pst.2pl,pst.3pl,pst.sbjv.1sg,pst.sbjv.2sg,pst.sbjv.3sg,pst.sbjv.1pl,pst.sbjv.2pl,pst.sbjv.3pl,imp.2sg,imp.1pl,imp.2pl,inf,prs.ptcp,pst.ptcp.m.sg,pst.ptcp.m.pl,pst.ptcp.f.sg,pst.ptcp.f.pl
accroire,accroire,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,akʁwaʁ,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#
advenir,advenir,#DEF#,#DEF#,advjɛ̃,#DEF#,#DEF#,advjɛn,#DEF#,#DEF#,advənE,#DEF#,#DEF#,advənE,#DEF#,#DEF#,advjɛ̃dʁa,#DEF#,#DEF#,advjɛ̃dʁɔ̃,#DEF#,#DEF#,advjɛ̃dʁE,#DEF#,#DEF#,advjɛ̃dʁE,#DEF#,#DEF#,advjɛn,#DEF#,#DEF#,advjɛn,#DEF#,#DEF#,advɛ̃,#DEF#,#DEF#,advɛ̃ʁ,#DEF#,#DEF#,advɛ̃,#DEF#,#DEF#,advɛ̃s,#DEF#,#DEF#,#DEF#,advəniʁ,advənɑ̃,advəny,advəny,advəny,advəny
Overabundance¶
Inflectional paradigms sometimes have some overabundant forms, where the same lexeme and paradigm cell can be realized in various ways, as in “dreamed” vs “dreamt” for the English past of “to dream”. Concurrent forms can be written in the same cell, separated by “;”. Only some scripts can make use of this information, the other scripts will use the first value only. Here is an example for English verbs:
lexeme ppart pres3s prespart inf pres1s presothers past13 pastnot13 bind baˑɪnd;baˑʊnd baˑɪndz baˑɪndɪŋ baˑɪnd baˑɪnd baˑɪnd baˑɪnd;baˑʊnd baˑɪnd;baˑʊnd wind(air) waˑʊnd;waˑɪndɪd waˑɪndz waˑɪndɪŋ waˑɪnd waˑɪnd waˑɪnd waˑʊnd;waˑɪndɪd waˑʊnd;waˑɪndɪd weave wəˑʊvn̩;wiːvd wiːvz wiːvɪŋ wiːv wiːv wiːv wəˑʊv;wiːvd wəˑʊv;wiːvd slink slʌŋk;slæŋk;slɪŋkt slɪŋks slɪŋkɪŋ slɪŋk slɪŋk slɪŋk slʌŋk;slæŋk;slɪŋkt slʌŋk;slæŋk;slɪŋkt dream driːmd driːmz driːmɪŋ driːm driːm driːm driːmd;drɛmt driːmd;drɛmt
Defectivity¶
On the contrary, some lexemes might be defective for some cells, and have no values whatsoever for these cells. The most explicit way to indicate these missing values is to write “#DEF#” in the cell. The cell can also be left empty. Note that some scripts ignore all lines with defective values.
Here are some examples from French verbs:
lexeme prs.1sg prs.2sg prs.3sg prs.1pl prs.2pl prs.3pl ipfv.1sg ipfv.2sg accroire #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# advenir #DEF# #DEF# advjɛ̃ #DEF# #DEF# advjɛn #DEF# #DEF# ardre #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# aʁdE aʁdE braire #DEF# #DEF# bʁE #DEF# #DEF# bʁE #DEF# #DEF# chaloir #DEF# #DEF# ʃo #DEF# #DEF# #DEF# #DEF# #DEF# comparoir #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# discontinuer #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# douer #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# échoir #DEF# #DEF# eʃwa #DEF# #DEF# #DEF# #DEF# #DEF# endêver #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF#
The phonological segments file¶
Qumin works from the assumption that your paradigms are written in phonemic notation. The phonological segments file provides a list of phonemes and their decomposition into distinctive features. This file is first used to segment the paradigms into sequences of phonemes (rather than sequences of characters). Then, the distinctive features are used to recognize phonological similarity and natural classes when creating and handling alternation patterns.
To create a new segments file, the best is usually to refer to an authoritative description, and adapt it to the needs of the specific dataset. In the absence of such a description, I suggest to make use of Bruce Hayes’ spreadsheet as a starting point (he writes +
, -
and 0
for our 1
,0
and -1
).
Format¶
The segments file is also written in wide format, with each row describing a phoneme. The first column gives phonemes as they are written in the paradigms file. Each column represents a distinctive feature. Here is an example with just 10 rows of the segments table for French verbs:
Seg. sonant syllabique consonantique continu nasal haut bas arrière arrondi antérieur CORONAL voisé rel.ret. p 0 0 1 0 0 0 0 1 0 0 b 0 0 1 0 0 0 0 1 1 0 t 0 0 1 0 0 0 0 1 1 0 0 s 0 0 1 1 0 0 0 1 1 0 1 i 1 1 0 1 0 1 0 0 0 1 1 y 1 1 0 1 0 1 0 0 1 1 1 u 1 1 0 1 0 1 0 1 1 1 1 o 1 1 0 1 0 0 1 1 1 1 a 1 1 0 1 0 0 1 1 0 1 1 ɑ̃ 1 1 0 1 1 0 1 1 0 1 1
Some conventions:
- The first column must be called
Seg.
. - The phonological symbols, in the
Seg.
column cannot be one of the reserved character :. ^ $ * + ? { } [ ] / | ( ) < > _ ⇌ , ;
. - If the file contains a “value” column, it will be ignored. This is used to provide a human-readable description of segments, which can be useful when preparing the data.
- In order to provide short names for the features, as in [+nas] rather than [+nasal], you can add a second level of header, also beginning by
Seg.
, which gives abbreviated names:
Seg. sonant syllabique consonantique continu nasal haut bas arrière arrondi antérieur CORONAL voisé rel.ret. Seg. son syl cons cont nas haut bas arr rond ant COR vois rel.ret. p 0 0 1 0 0 0 0 1 0 0 b 0 0 1 0 0 0 0 1 1 0
The file is encoded in utf-8 and can be either a csv table (preferred) or a tabulation separated table (tsv).
%%sh
head -n 6 "../Data/Vlexique/frenchipa.csv"
Seg.,sonant,syllabique,consonantique,continu,nasal,haut,bas,arrière,arrondi,antérieur,CORONAL,voisé,rel.ret.
Seg.,son,syl,cons,cont,nas,haut,bas,arr,rond,ant,COR,vois,rel.ret.
p,0,0,1,0,0,0,,0,,1,,0,0
b,0,0,1,0,0,0,,0,,1,,1,0
t,0,0,1,0,0,0,,0,,1,1,0,0
d,0,0,1,0,0,0,,0,,1,1,1,0
Segmentation and aliases¶
Since the forms in the paradigms are not segmented into phonemes, the phonological segments file is used to segment them.
It is possible to specify phonemes which are more than one character long, for example using combining characters, or for diphthongs and affricates. Be careful of using the same notation as in your paradigms. For example, you can not use “a” + combining tilde in one, and the precomposed “ã” in the other file, as the program would not recognize them as the same thing. You should however be certain that there is no segmentation ambiguity. If you have sequences such as “ABC” which should be segmented “AB.C” in some contexts and “A.BC” in some other contexts, you need to change the notation in the paradigms file so that it is not ambiguous, for example by writing “A͡BC” in the first case and “AB͡C” in the second case. You would then have separate rows for “A”, “A͡B”, “C” and “B͡C” in the segments file.
Internally, the program will use arbitrary aliases which are 1 character long to replace longer phonemes – this substitution will be reversed in the output. While this usually works without your intervention, you can provide your own aliases if you want to preserve some readability in debug logs. This is done by adding a column “ALIAS” right after the fist column, which holds 1-char aliases. This example shows a few rows for the segment files of navajo:
Seg. ALIAS syllabic htone long consonantal sonorant continuant delayed release … ɣ 0 0 1 0 1 1 … k 0 0 1 0 0 0 … k’ ḱ 0 0 1 0 0 0 … k͡x K 0 0 1 0 0 1 … t 0 0 1 0 0 0 … ť 0 0 1 0 0 0 … t͡ɬ L 0 0 1 0 0 1 … t͡ɬ’ Ľ 0 0 1 0 0 1 … t͡ɬʰ Ḷ 0 0 1 0 0 1 … ʦ 0 0 1 0 0 1 … ʦ’ Ś 0 0 1 0 0 1 … ʦʰ Ṣ 0 0 1 0 0 1 … ʧ H 0 0 1 0 0 1 … ʧ’ Ḣ 0 0 1 0 0 1 … ʧʰ Ḥ 0 0 1 0 0 1 … t͡x T 0 0 1 0 0 1 … … … … … … … … … … …
If you have many multi-character phonemes, you may get the following error:
ValueError: ('I can not guess a good one-char alias for ã, please use an ALIAS column to provide one.',
'occurred at index 41')
The solution is to add an alias for this character, and maybe a few others. To find aliases which vaguely resemble the proper symbols, this table of unicode characters organized by letter are often useful.
Shorthands¶
When writing phonological rules, linguists often use shorthands like “V” for the natural class of all vowels, and “C” for the natural class of all consonants. If you want, you can provide some extra rows in the table to define shorthand names for some natural classes. These names have to start and end by “#”. Here an example for the French segments file, giving shorthands for C (consonants), V (vowels) and G (glides):
Seg. sonant syllabique consonantique continu nasal haut bas arrière arrondi antérieur CORONAL voisé rel.ret. Seg. son syl cons cont nas haut bas arr rond ant COR vois rel.ret. #C# 0 1 #V# 1 1 0 1 1 1 #G# 1 0 0 1 0 1 0 0 1 1
Values of distinctive features¶
Distinctive features are usually considered to be bivalent: they can be either positive ([+nasal]) or negative ([-nasal]). In the Segments file, positive values are written by the number 1
, and negative values by the number 0
. Some features do not apply at all to some phonemes, for example consonants are neither [+round] nor [-round]. This can be written either by -1
, or by leaving the cell empty. While the first is more explicit, leaving the cell empty makes the tables more readable at a glance. The same strategy is used for features which are privative, as for example [CORONAL]: there is no class of segments which are [-coronal], so we can write either 1
or -1
in the corresponding column, not using 0
.
While 1
, 0
and -1
(or nothing) are the values that make the most sense, any numeric values are technically allowed, for example [-back], [+back] and [++back] could be expressed by writing 0
, 1
, and 2
in the “back” column. I do not recommend doing this.
When writing segments file, it is important to be careful of the naturality of natural classes, as Qumin will take them at face value. For example, using the same [±high] feature for both vowels and consonants will result in a natural class of all the [+high] segments, and one for all the [-high] segments. Sometimes, it is better to duplicate some columns to avoid generating unfounded classes.
Monovalent or bivalent features¶
Frisch (1996) argues that monovalent features (using only -1
and 1
) are to be preferred to bivalent features, as the latter implicitly generate natural classes for the complement features ([-coronal]), which is not always desirable. In Qumin, both monovalent and bivalent features are accepted. Internally, the program will expand all 1
and 0
into + and - values. As an example, take this table which classifies the three vowels /a/, /i/ and /u/:
Seg. | high | low | front | back | round | Non-round |
Seg. | high | low | front | back | round | Non-round |
a | 1 | 1 | 1 | |||
i | 1 | 1 | 1 | |||
u | 1 | 1 | 1 |
Internally, Qumin will construct the following table, which looks almost identical because we used monovalued features:
Seg. | +high | +low | +front | +back | +round | +Non-round |
---|---|---|---|---|---|---|
a | x | x | x | |||
i | x | x | x | |||
u | x | x | x |
This will then result in the following natural class hierarchy:

To visualize natural class hierarchies declared by segment files, you can use FeatureViz.
The same thing can be achieved with less columns using binary features:
Seg. | high | front | round |
Seg. | high | front | round |
a | 0 | 0 | 0 |
i | 1 | 1 | 0 |
u | 1 | 0 | 1 |
Internally, these will be expanded to:
Seg. | +high | -high | +front | -front | +round | -round |
---|---|---|---|---|---|---|
a | x | x | x | |||
i | x | x | x | |||
u | x | x | x |
Which is the same thing as previously, with different names. The class hierarchy is also very similar:

Warning, some of the segments aren’t actual leaves¶
The following error occurs when the table is well formed, but specifies a natural class hierarchy which is not usable by Qumin:
Exception: Warning, some of the segments aren't actual leaves :
p is the same node as [p-kʷ]
[p-kʷ] ([pĸ]) = [+cons -son -syll +lab -round -voice -cg -cont -strid -lat -del.rel -nas -long]
kʷ (ĸ) = [+cons -son -syll +lab -round +dor +highC -lowC +back -tense -voice -cg -cont -strid -lat -del.rel -nas -long]
k is the same node as [k-kʷ]
[k-kʷ] ([kĸ]) = [+cons -son -syll +dor +highC -lowC +back -tense -voice -cg -cont -strid -lat -del.rel -nas -long]
kʷ (ĸ) = [+cons -son -syll +lab -round +dor +highC -lowC +back -tense -voice -cg -cont -strid -lat -del.rel -nas -long]
What happened here is that the natural class [p-kʷ] has the exact same definition as just /p/. Similarly, the natural class [k-kʷ] has the same definition as /k/. The result is the following structure, in which /p/ and /k/ are superclasses of /kʷ/:

In this structure, it is impossible to distinguish the natural classes [p-kʷ] and [k-kʷ] from the respective ponemes /p/ and /k/. Instead, we want them to be one level lower. If we ignore the bottom node, this means that they should be leaves of the hierarchy.
The solution is to ensure that both /p/ and /k/ have at least one feature divergent from [kʷ]. Usually, kʷ is marked as [+round], but in the above it is mistakenly written [-round]. Correcting this definitions yields the following structure, and solves the error:

Neutralizations¶
While having a segment be higher than another in the hierarchy is forbidden, it is possible to declare two segments with the exact same features. This is useful if you want to neutralize some oppositions, and ignore some details in the data.
For example, this set of French vowels display height oppositions using the [±low] feature:
Seg. | sonant | syllabique | consonantique | continu | nasal | haut | bas | arrière | arrondi | antérieur | coronal | voisé | rel.ret. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Seg. | son | syl | cons | cont | nas | haut | bas | arr | rond | ant | cor | vois | rel.ret. |
e | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | -1 | -1 | 1 | 1 |
ɛ | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | -1 | -1 | 1 | 1 |
ø | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | -1 | -1 | 1 | 1 |
œ | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | -1 | -1 | 1 | 1 |
o | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | -1 | -1 | 1 | 1 |
ɔ | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | -1 | -1 | 1 | 1 |
Leading to this complex hierarchy:

Due to regional variations, the French Vlexique sometimes neutralizes this oppositions, and writes E, Ø and O to underspecify the value of the vowels. The solution is to neutralize entirely the [±low] distinction for these vowels, writing repeated rows for E, e, ɛ, etc:
Seg. | sonant | syllabique | consonantique | continu | nasal | haut | bas | arrière | arrondi | antérieur | coronal | voisé | rel.ret. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Seg. | son | syl | cons | cont | nas | haut | bas | arr | rond | ant | cor | vois | rel.ret. |
E | 1 | 1 | 0 | 1 | 0 | 0 | -1 | 0 | 0 | -1 | -1 | 1 | 1 |
e | 1 | 1 | 0 | 1 | 0 | 0 | -1 | 0 | 0 | -1 | -1 | 1 | 1 |
ɛ | 1 | 1 | 0 | 1 | 0 | 0 | -1 | 0 | 0 | -1 | -1 | 1 | 1 |
Ø | 1 | 1 | 0 | 1 | 0 | 0 | -1 | 0 | 1 | -1 | -1 | 1 | 1 |
ø | 1 | 1 | 0 | 1 | 0 | 0 | -1 | 0 | 1 | -1 | -1 | 1 | 1 |
œ | 1 | 1 | 0 | 1 | 0 | 0 | -1 | 0 | 1 | -1 | -1 | 1 | 1 |
O | 1 | 1 | 0 | 1 | 0 | 0 | -1 | 1 | 1 | -1 | -1 | 1 | 1 |
o | 1 | 1 | 0 | 1 | 0 | 0 | -1 | 1 | 1 | -1 | -1 | 1 | 1 |
ɔ | 1 | 1 | 0 | 1 | 0 | 0 | -1 | 1 | 1 | -1 | -1 | 1 | 1 |
Internally, Qumin will replace all of these identical characters by a single unified one (the first in the file). The simplified structure becomes:

Creating scales¶
Rather than using many-valued features, it is often preferrable to use a few monovalent or bivalent features to create a scale. As an example, here is a possible (bad) implementation for tones, which uses a single feature “Tone”.
Seg. | Tone |
---|---|
Seg. | Tone |
˥ | 3 |
˦ | 2 |
˧ | 1 |
˨ | 0 |
It results in this natural class hierarchy:

While such a file is allowed, it results in the tones having nothing in common. If some morpho-phonological alternations selects both high and mid tones, we will miss that generalization.
To express a scale, a simple solution is to create one less feature than there are segments (here four tones lead to three scale features), then fill in the upper diagonal with 1
and the lower diagonal with 0
(or the opposite). For example:
Seg. | scale1 | scale2 | scale3 |
---|---|---|---|
Seg. | scale1 | scale2 | scale3 |
˥ | 1 | 1 | 1 |
˦ | 0 | 1 | 1 |
˧ | 0 | 0 | 1 |
˨ | 0 | 0 | 0 |
It will result in the natural classes below:

Since this is not very readable, we can re-write the same thing in a more readable way using a combination of binary and monovalent features:
Seg. | Top | High | Low | Bottom |
---|---|---|---|---|
Seg. | Top | High | Low | Bottom |
˥ | 1 | 1 | 0 | |
˦ | 0 | 1 | 0 | |
˧ | 0 | 1 | 0 | |
˨ | 0 | 1 | 1 |
Which leads to the same structure:

When implementing tones, I recommend to mark them all as [-segmental] to ensure that they share a common class, and to write all other features as [+segmental].
Diphthongs¶
Diphthongs are not usually decomposed using distinctive features, as they are complex sequences (see this question on the Linguist List). However, if diphthongs alternate with simple vowels in your data, adding diphthongs in the list of phonological segments can allow Qumin to capture better generalizations. The strategy I have employed so far is the following:
- Write diphthongs in a non-ambiguous way in the data (either ‘aj’ or ‘aˑi’, but not ‘ai’ when the same sequence can sometimes be two vowels)
- Copy the features from the initial vowel
- Add a monovalent feature [DIPHTHONG]
- Add monovalent features [DIPHTHONG_J], [DIPHTHONG_W], etc, as needed.
This is a small example for a few English diphthongs:
Seg. | high | low | back | LABIAL | tense | diphtong j | diphtong ə | diphtong W | diphtong |
---|---|---|---|---|---|---|---|---|---|
Seg. | high | low | back | LAB | tens | diph.j | diph.ə | diph.w | diph |
a | 0 | 1 | 0 | 1 | 0 | ||||
aˑʊ | 0 | 1 | 1 | 1 | 1 | 1 | |||
aˑɪ | 0 | 1 | 1 | 1 | 1 | 1 | |||
ɪ | 1 | 0 | 0 | 0 | 0 | ||||
ɪˑə | 1 | 0 | 0 | 0 | 1 | 1 |
Which leads to the following classes:

Others¶
- Stress: I recommend to mark it directly on vowels, and duplicate the vowel inventory to have both stressed and unstressed counterpart. A simple binary [±stress] feature is enough to distinguish them.
- Length: Similarly, I recommend to mark length, when possible, on vowels, rather than duplicating them.
Usages¶
Usage of bin/find_patterns.py¶
Find pairwise alternation patterns from paradigms. This is a preliminary step necessary to obtain patterns used as input in the three scripts below.
Computing automatically aligned patterns for paradigm entropy or macroclass:
bin/$ python3 find_patterns.py <paradigm.csv> <segments.csv>
Computing automatically aligned patterns for lattices:
bin/$ python3 find_patterns.py -d -o -c <paradigm.csv> <segments.csv>
The option -k allows one to choose the algorithm for inferring alternation patterns.
Option | Description | Strategy |
---|---|---|
endings | Affixes | Removes the longest common initial string for each row. |
endingsPairs | Pairs of affixes | Endings, tabulated as pairs for all combinations of columns. |
endingsDisc | Discontinuous endings | Removes the longest common substring, left aligned |
….Alt | Alternations | Alternations have no contextes. These were used for comparing macroclass strategies on French and European Portuguese. |
globalAlt | Alternations | As EndingsDisc, tabulated as pairs for all combinations of columns. |
localAlt | Alternations | Inferred from local pairs of cells, left aligned. |
patterns… | Binary Patterns | All patterns have alternations and generalized contexts. Various alignment strategies are offered for comparison. Arbitrary number of changes supported. |
patternsLevenshtein | Patterns | Aligned with simple edit distance. |
patternsPhonsim | Patterns | Aligned with edit distances based on phonological similarity. |
patternsSuffix | Patterns | Fixed left alignment, only interesting for suffixal languages. |
patternsPrefix | Patterns | Fixed right alignment, only interesting for prefixal languages. |
patternsBaseline | Patterns | Baseline alignment, follows Albright & Hayes 2002. A single change, with a priority order: Suffixation > Prefixation > Stem-internal alternation (ablaut/infixation) |
Most of these were implemented for comparison purposes. I recommend to use the default patternsPhonsim in most cases. To avoid relying on your phonological features files for alignment scores, use patternsLevenshtein. Only these two are full patterns with generalization both in the context and alternation.
For lattices, we keep defective and overabundant entries. We do not usually keep them for other applications. The latest code for entropy can handle defective entries. The file you should use as input for the below scripts has a name that ends in “_patterns”. The “_human_readable_patterns” file is nicer to review but is only meant for human usage.
Usage of bin/calc_paradigm_entropy.py¶
Compute entropies of flexional paradigms’ distributions.
Computing entropies from one cell
bin/$ python3 calc_paradigm_entropy.py -o <patterns.csv> <paradigm.csv> <segments.csv>
Computing entropies from one cell, with a split dataset
bin/$ python3 calc_paradigm_entropy.py -names <data1 name> <data2 name> -b <patterns1.csv> <paradigm1.csv> -o <patterns2.csv> <paradigm2.csv> <segments.csv>
Computing entropies from two cell
bin/$ python3 calc_paradigm_entropy.py -n 2 <patterns.csv> <paradigm.csv> <segments.csv>
More complete usage can be obtained by typing
bin/$ python3 calc_paradigm_entropy.py --help
With –nPreds and N>2 the computation can get quite long on large datasets.
Usage of bin/find_macroclasses.py¶
Cluster lexemes in macroclasses according to alternation patterns.
Inferring macroclasses
bin/$ python3 find_macroclasses.py <patterns.csv> <segments.csv>
More complete usage can be obtained by typing
bin/$ python3 find_macroclasses.py --help
The options “-m UPGMA”, “-m CD” and “-m TD” are experimental and will not undergo further development, use at your own risks. The default is to use Description Length (DL) and a bottom-up algorithm (BU).
Usage of bin/make_lattice.py¶
Infer Inflection classes as a lattice from alternation patterns. This will produce a context and an interactive html file.
Inferring a lattice of inflection classes, with html output
bin/$ python3 make_lattice.py --html <patterns.csv> <segments.csv>
More complete usage can be obtained by typing
bin/$ python3 make_lattice.py --help
bin¶
clustering package¶
Submodules¶
clustering.algorithms module¶
Algorithms for inflection classes clustering.
Author: Sacha Beniamine
-
clustering.algorithms.
bottom_up_clustering
(patterns, microclasses, Clusters, **kwargs)[source]¶ Cluster microclasses in a top-down recursive fashion.
The algorithm is the following:
Begin with one cluster per microclasses. While there is more than one cluster : Find the best possible merge of two clusters, among all possible pairs. Perform this merge
Scoring, finding the best merges, merging nodes depends on the Clusters class.
Parameters: - patterns (
pandas.DataFrame
) – a dataframe of patterns. - (dict of str (microclasses) – list): mapping of microclasses exemplars to microclasses inventories.
- Clusters – a cluster class to use in clustering.
- kwargs – any keywords arguments to pass to Clusters.
- patterns (
-
clustering.algorithms.
hierarchical_clustering
(patterns, Clusters, clustering_algorithm=<function bottom_up_clustering>, **kwargs)[source]¶ Perform hierarchical clustering on patterns according to a clustering algorithm and a measure.
- This function ::
- Finds microclasses. Performs the clustering, Finds the macroclasses (and exports them), Returns the inflection class tree.
Scoring, finding the best merges, merging nodes depends on the Clusters class.
Parameters: - patterns (
pandas.DataFrame
) – a dataframe of strings representing alternation patterns. - Clusters – a cluster class to use in clustering.
- clustering_algorithm (func) – a clustering algorithm.
- kwargs – any keywords arguments to pass to Clusters. Some keywords are mandatory : “prefix” should be the log file prefix, “patterns” should be a function for pattern finding
-
clustering.algorithms.
top_down_clustering
(patterns, microclasses, Clusters, **kwargs)[source]¶ Cluster microclasses in a top-down recursive fashion.
The algorithm is the following:
Begin with one unique cluster containing all microclasses, and one empty cluster. While we are seeing an improvement: Find the best possible shift of a microclass from one cluster to another. Perform this shift. Build a binary node with the two clusters. Recursively apply the same algorithm to each.
The algorithm stops when it reaches leaves, or when no shift improves the score.
Scoring, finding the best shits, updating the nodes depends on the Clusters class.
Parameters: - patterns (
pandas.DataFrame
) – a dataframe of patterns. - (dict of str (microclasses) – list): mapping of microclasses exemplars to microclasses inventories.
- Clusters – a cluster class to use in clustering.
- kwargs – any keywords arguments to pass to Clusters.
- patterns (
clustering.clusters module¶
Base classes to make clustering decisions and build inflection class trees.
Author: Sacha Beniamine
-
class
clustering.clusters.
BUComparisonClustersBuilder
(*args, DecisionMaker=None, Annotator=None, **kwargs)[source]¶ Bases:
clustering.clusters._BUClustersBuilder
Comparison between measures for hierarchical clustering bottom-up clustering of Inflection classes.
This class takes two _BUClustersBuilder classes, a DecisionMaker and an Annotator. The DecisionMaker is used to find the ordered merges. When merging, the merge is performed on both classes, and the Annotator’s values (description length or distances) are used to annotate the trees of the DecisionMaker.
Variables: - (dict of str (microclasses) – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
- (dict of frozenset (nodes) – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
- preferences (dict) – Inherited. Configuration parameters.
- ( (Annotator) – class:clustering.clusters._BUClustersBuilder): A class to use for finding ordered merges.
- ( – class:clustering.clusters._BUClustersBuilder): A class to use for annotating the DecisionMaker.
clustering.descriptionlength module¶
Classes to make clustering decisions and build inflection class trees according to description length.
Author: Sacha Beniamine
-
class
clustering.descriptionlength.
BUDLClustersBuilder
(microclasses, paradigms, **kwargs)[source]¶ Bases:
clustering.descriptionlength._DLClustersBuilder
,clustering.clusters._BUClustersBuilder
Bottom up Builder for hierarchical clusters of inflection classes with description length based decisions.
This class holds two representations of the clusters it builds. On one hand, the class Cluster represents the informations needed to compute the description length of a cluster. On the other hand, the class Node represents the inflection classes being built. A Node can have children and a parent, a Cluster can be splitted or merged.
This class inherits attributes.
Variables: - (dict of str (patterns) – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
- (dict of frozenset (clusters) – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
- preferences (dict) – Inherited. Configuration parameters.
- attr (str) – Inherited. (class attribute) always have the value “DL”, as the nodes of the Inflection class tree have a “DL” attribute.
- DL (float) – Inherited. A description length DL, with DL(system) = DL(M) + DL(C) + DL(P) + DL(R)
- M (float) – Inherited. DL(M), the cost in bits to express the mapping between lexemes and microclasses.
- C (float) – Inherited. DL(C), the cost in bits to express the mapping between microclasses and clusters.
- P (float) – Inherited. DL(P), the cost in bits to express the relation between clusters and patterns.
- R (float) – Inherited. DL(R), the cost in bits to disambiguiate which pattern to use in each cluster for each microclasses.
- (dict of frozenset –
Cluster
): Inherited. Clusters, indexed by a frozenset of microclass examplars. - (dict of str –
Counter): Inherited. A dict of pairs of cells to a count of patterns to the number of clusters presenting this pattern for this cell.:
{ str: Counter({Pattern: int }) } pairs of cells -> pattern -> number of clusters with this pattern for this cell
Note that the Counter’s length is written on a .length attribute, to avoid calling len() repeatedly. Remark that the count is not the same as in the class Cluster.
- size (int) – Inherited. The size of the whole system in microclasses.
-
class
clustering.descriptionlength.
Cluster
(*args)[source]¶ Bases:
object
A single cluster in MDL clustering.
A Cluster is iterable. Itering on a cluster is itering on its patterns. Cluster can be merged or separated by adding or substracting them.
Variables: - patterns (
defaultdict
) –For each pair of cell in the paradigms under consideration, it holds a counter of the number of microclass using each pattern in this cluster and pair of cells.:
{ str: Counter({Pattern: int }) } pairs of cells -> pattern -> number of microclasses using this pattern for this cell
Note that the Counter’s length is written on a .length attribute, to avoid calling len() repeatedly.
- labels (set) – the set of all exemplars representing the microclasses in this cluster.
- size (int) – The size of this cluster. Depending on external parameters, this can be the number of microclasses or the number of lexemes belonging to the cluster.
- totalsize (int) – The size of the whole system of clusters, either number of microclasses in the system, or number of lexemes in the system.
- R – The cost in bits to disambiguate for each pair of cells which pattern is to be used with which microclass.
- C – The contribution of this cluster to the cost of mapping from microclasses to clusters.
-
__init__
(*args)[source]¶ Initialize single cluster.
Parameters: args (str) – Names (exemplar) of each microclass belonging to the cluster.
-
init_from_paradigm
(class_size, paradigms, size)[source]¶ Populate fields according to a paradigm column.
This assumes an initialization with only one microclass.
Parameters: - class_size (int) – the size of the microclass
- paradigms (
pandas.DataFrame
) – a dataframe of patterns. - size (int) – total size
- patterns (
-
class
clustering.descriptionlength.
TDDLClustersBuilder
(microclasses, paradigms, **kwargs)[source]¶ Bases:
clustering.descriptionlength._DLClustersBuilder
Top down builder for hierarchical clusters of inflection classes with description length based decisions.
This class holds two representations of the clusters it builds. On one hand, the class Cluster represents the informations needed to compute the description length of a cluster. On the other hand, the class Node represents the inflection classes being built. A Node can have children and a parent, a Cluster can be splitted or merged.
This class inherits attributes.
Variables: - (dict of str (patterns) – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
- (dict of frozenset (history) – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
- preferences (dict) – Inherited. Configuration parameters.
- attr (str) – Inherited. (class attribute) always have the value “DL”, as the nodes of the Inflection class tree have a “DL” attribute.
- DL (float) – Inherited. A description length DL, with DL(system) = DL(M) + DL(C) + DL(P) + DL(R)
- M (float) – Inherited. DL(M), the cost in bits to express the mapping between lexemes and microclasses.
- C (float) – Inherited. DL(C), the cost in bits to express the mapping between microclasses and clusters.
- P (float) – Inherited. DL(P), the cost in bits to express the relation between clusters and patterns.
- R (float) – Inherited. DL(R), the cost in bits to disambiguiate which pattern to use in each cluster for each microclasses.
- (dict of frozenset –
Cluster
): Inherited. Clusters, indexed by a frozenset of microclass examplars. - (dict of str –
Counter): Inherited. A dict of pairs of cells to a count of patterns to the number of clusters presenting this pattern for this cell.:
{ str: Counter({Pattern: int }) } pairs of cells -> pattern -> number of clusters with this pattern for this cell
Note that the Counter’s length is written on a .length attribute, to avoid calling len() repeatedly. Remark that the count is not the same as in the class Cluster.
- size (int) – Inherited. The size of the whole system in microclasses.
- minDL (float) – The minimum description length yet encountered.
- (dict of frozenset – tuples): dict associating partitions with (M, C, P, R, DL) tuples.
- left (Cluster) – left and right are temporary clusters used to divide a current cluster in two.
- right (Cluster) – see left.
- to_split (Node) – the node that we are currently trying to split.
-
__init__
(microclasses, paradigms, **kwargs)[source]¶ Constructor.
Parameters: - (dict of str (microclasses) – list): mapping of microclasses exemplars to microclasses inventories.
- paradigms (
pandas.DataFrame
) – a dataframe of patterns. - kwargs – keyword arguments to be used as configuration.
-
find_ordered_shifts
()[source]¶ Find the list of all best shifts of a microclass between right and left.
The list is a list of tuples of length 2 containing the label of a node to shift and the description length of the node to be splitted if we perform the shift.
-
initialize_clusters
(paradigms)[source]¶ Initialize clusters with one cluster per microclass plus one for the whole.
Parameters: paradigms ( pandas.DataFrame
) – a dataframe of patterns.
-
initialize_nodes
()[source]¶ Initialize nodes with only one root node which children are all microclasses.
-
initialize_subpartition
(node)[source]¶ Initialize left and right as a subpartition of a node we want to split.
Parameters: node (Node) – The node to be splitted.
-
clustering.descriptionlength.
weighted_log
(symbol_count, message_length)[source]¶ Compute \(-log_{2}(symbol_count/message_length) * message_length\).
This corresponds to the product inside the sum of the description length formula when probabilities are estimated on frequencies.
Parameters: Returns: the weighted log
Return type: (float)
clustering.distances module¶
Classes and functions to make clustering decisions and build inflection class trees according to distances.
Still experimental, and unfinished.
Author: Sacha Beniamine
-
class
clustering.distances.
CompressionDistClustersBuilder
(*args, **kwargs)[source]¶ Bases:
clustering.distances._DistanceClustersBuilder
Builder for bottom up hierarchical clusters of inflection classes with compression distance.
This class inherits attributes.
Variables: - (dict of str (microclasses) – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
- (dict of frozenset (DL_dict) – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
- preferences (dict) – Inherited. Configuration parameters.
- attr (str) – Inherited. always have the value “DL”, as the nodes of the Inflection class tree have a “DL” attribute.
- (dict of str – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
- (dict of frozenset – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
- preferences – Inherited. Configuration parameters.
- paradigms (
pandas.DataFrame
) – Inherited. a dataframe of patterns. - distances (dict) – Inherited. The distance score_matrix between clusters.
- (dict of frozenset – float): Maps each cluster to its description length.
- min_DL (float) – the lowest description length for the whole system yet encountered.
-
clustering.distances.
DL
(messages)[source]¶ Compute the description length of a list of messages encoded separately.
Parameters: messages (list) – List of lists of symbols. Symbols are str. They are treated as atomic.
-
class
clustering.distances.
UPGMAClustersBuilder
(*args, **kwargs)[source]¶ Bases:
clustering.distances._DistanceClustersBuilder
Builder for UPGMA hierarchical clusters of inflection classes with hamming distance.
Variables: - (dict of str (microclasses) – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
- (dict of frozenset (nodes) – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
- preferences (dict) – Inherited. Configuration parameters.
- attr (str) – Inherited. always have the value “DL”, as the nodes of the Inflection class tree have a “DL” attribute.
- (dict of str – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
- (dict of frozenset – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
- preferences – Inherited. Configuration parameters.
- paradigms (
pandas.DataFrame
) – Inherited. a dataframe of patterns. - distances (dict) – Inherited. The distance score_matrix between clusters.
-
clustering.distances.
compression_distance
(a, b, merged)[source]¶ Compute the compression distances between description lengths.
Parameters:
-
clustering.distances.
compression_distance_atomic
(x, y, table, microclasses, *args, **kwargs)[source]¶ Compute the compression distances between microclasses x and y from their exemplars.
Parameters: - x (str) – A microclass exemplar.
- y (str) – A microclass exemplar.
- table (
pandas.DataFrame
) – a dataframe of patterns. - (dict of str (microclasses) – list): mapping of microclasses exemplars to microclasses inventories.
-
clustering.distances.
dist_matrix
(table, *args, labels=None, distfun=<function hamming>, half=False, default=inf, **kwargs)[source]¶ Output a distance score_matrix between clusters.
Parameters: - table (
pandas.DataFrame
) – a dataframe of patterns. - distfun (fun) – distance function.
- labels (iterable) – the labels between which to compute distance. Defaults to the table’s index.
- half (bool) – Wether to fill only a half score_matrix.
- default (float) – Default distance.
Returns: the similarity score_matrix.
Return type: distances (dict)
- table (
-
clustering.distances.
hamming
(x, y, table, *args, **kwargs)[source]¶ Compute hamming distances between x and y in table.
Parameters: - x (any iterable) – vector.
- y (any iterable) – vector.
- table (
pandas.DataFrame
) – a dataframe of patterns.
Returns: the hamming distance between x and y.
Return type: (int)
-
clustering.distances.
split_description
(descriptions)[source]¶ Split each description of a list on spaces to obtain symbols.
-
clustering.distances.
table_to_descr
(table, exemplars, microclasses)[source]¶ Create a list of descriptions from a paradigmatic table.
Parameters: - table (
pandas.DataFrame
) – a dataframe of patterns. - exemplars (iterable of str) – The microclasses to include in the description.
- (dict of str (microclasses) – list): mapping of microclasses exemplars to microclasses inventories.
- table (
clustering.utils module¶
Utilities used in clustering.
Author:Sacha Beniamine.
-
class
clustering.utils.
Node
(labels, children=None, **kwargs)[source]¶ Bases:
object
Represent an inflection class tree.
Variables: - labels (list) – labels of all the leaves under this node.
- children (list) – direct children of this node.
- attributes (dict) –
attributes for this node. Currently, three attributes are expected: size (int): size of the group represented by this node. DL (float): Description length for this node. color (str): color of the splines from this node to its children, in a format usable by pyplot. Currently, red (“r”) is used when the node didn’t decrease Description length, blue (“b”) otherwise. macroclass (bool): Is the node in a macroclass ? macroclass_root (bool): Is the node the root of a macroclass ?
The attributes “_x” and “_rank” are reserved, and will be overwritten by the draw function.
-
__init__
(labels, children=None, **kwargs)[source]¶ Node constructor.
Parameters: - labels (iterable) – labels of all the leaves under this node.
- children (list) – direct children of this node.
- kwargs – any other keyword argument will be added as node attributes. Note that certain algorithm expect the Node to have (int) “size”, (str) “color”, (bool) “macroclass”, or (float) “DL” attributes.
- attributes "_x" and "_rank" are reserved, (The) –
- will be overwritten by the draw function. (and) –
-
draw
(horizontal=False, square=False, leavesfunc=<function Node.<lambda>>, nodefunc=None, label_rotation=None, annotateOnlyMacroclasses=False, point=None, edge_attributes=None, interactive=False, lattice=False, pos=None, **kwargs)[source]¶ Draw the tree as a dendrogram-style pyplot graph.
Example:
square=True square=False │ ┌──┴──┐ │ ╱╲ horizontal=False │ │ ┌─┴─┐ │ ╱ ╲ │ │ │ │ │ ╱ ╱╲ │ │ │ │ │ ╱ ╱ ╲ │__│___│___│ │╱___╱____╲ │─────┐ │⟍ │───┐ ├ │ ⟍ horizontal=True │ ├─┘ │⟍ ⟋ │───┘ │⟋ │____________ │____________
Parameters: - horizontal (bool) – Should the tree be drawn with leaves on the y axis ? (Defaults to False: leaves on x axis).
- square (bool) – Should the tree splines be squared with 90° angles ? (Defauls to False)
- leavesfunc (fun) – A function that will be applied to leaves before writing them down. Takes a Node, returns a str.
- nodefunc (fun) – A function that will be applied to nodes to annotate them. Takes a Node, returns a str.
- keep_above_macroclass (bool) – For macroclass history trees: Should the edges above macroclasses be drawn ? (Defaults to True).
- annotateOnlyMacroclasses – For macroclass history trees: If True and nodelabel isn’t None, only the macroclasses nodes are annotated.
- point (fun) – A function that maps a node to point attributes.
- edge_attributes (fun) –
- A function that maps a pair of nodes to edge attributes.
- By default, use the parent’s color and “-” linestyle for nodes, “–” for leaves.
- interactive (bool) – Whether this is destined to create an interactive plot.
- lattice (bool) – Whether this node is a lattice rather than a tree.
- pos (dict) – A dictionnary of node label to x,y positions. Compatible with networkx layout functions. If absent, use networkx’s graphviz layout.
-
to_latex
(nodelabel=None, vertical=True, level_dist=50, square=True, leavesfunc=<function Node.<lambda>>, scale=1)[source]¶ Return a latex string, compatible with tikz-qtree
Parameters: - nodelabel – The name of the attribute to write on the nodes.
- vertical – Should the tree be drawn vertically ?
- level_dist – Distance between levels.
- square – Should the arcs have a squared shape ?
- leavesfunc (fun) – A function that will be applied to leaves before writing them down. Takes a Node, returns a str.
- scale (int) – defaults to 1. tikzpicture scale argument.
-
clustering.utils.
find_microclasses
(paradigms)[source]¶ Find microclasses in a paradigm (lines with identical rows).
This is useful to identify an exemplar of each inflection microclass, and limit further computation to the collection of these exemplars.
Parameters: paradigms (pandas.DataFrame) – a dataframe containing inflectional paradigms. Columns are cells, and rows are lemmas. Returns: - microclasses (dict).
- classes is a dict. Its keys are exemplars,
its values are lists of the name of rows identical to the exemplar.
Each exemplar represents a macroclass.
>>> classes {"a":["a","A","aa"], "b":["b","B","BBB"]}
Module contents¶
entropy package¶
Submodules¶
entropy.distribution module¶
author: Sacha Beniamine.
Encloses distribution of patterns on paradigms.
-
class
entropy.distribution.
PatternDistribution
(paradigms, patterns, pat_dic, features=None)[source]¶ Bases:
object
Statistical distribution of patterns.
Variables: - paradigms (
pandas.DataFrame
) – containing forms. - patterns (
pandas.DataFrame
) – containing pairwise patterns of alternation. - classes (
pandas.DataFrame
) – containing a representation of applicable patterns from one cell to another. Index are lemmas. - (dict of int (entropies) –
pandas.DataFrame
): dict mapping n to a dataframe containing the entropies for the distribution \(P(c_{1}, ..., c_{n} → c_{n+1})\).
-
__init__
(paradigms, patterns, pat_dic, features=None)[source]¶ Constructor for PatternDistribution.
Parameters: - patterns (
pandas.DataFrame
) – patterns (columns are pairs of cells, index are lemmas). - logfile (TextIOWrapper) – Flow on which to write a log.
- patterns (
-
entropy_matrix
(silent=False)[source]¶ Return a:class:pandas:pandas.DataFrame with unary entropies, and one with counts of lexemes.
The result contains entropy \(H(c_{1} \to c_{2})\).
Values are computed for all unordered combinations of \((c_{1}, c_{2})\) in the
PatternDistribution.paradigms
’s columns. Indexes are predictor cells \(c{1}\) and columns are the predicted cells \(c{2}\).Example
For two cells c1, c2, entropy of c1 → c2, noted \(H(c_{1} \to c_{2})\) is:
\[H( patterns_{c1, c2} | classes_{c1, c2} )\]
-
n_preds_distrib_log
(logfile, n, sanity_check=False)[source]¶ Print a log of the probability distribution for two predictors.
Writes down the distributions:
\[P( patterns_{c1, c3}, \; \; patterns_{c2, c3} \; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]for all unordered combinations of two column names in
PatternDistribution.paradigms
.Parameters: - logfile (
io.TextIOWrapper
) – Output flow on which to write. - n (int) – number of predictors.
- sanity_check (bool) – Use a slower calculation to check that the results are exact.
- logfile (
-
n_preds_entropy_matrix
(n)[source]¶ Return a:class:pandas:pandas.DataFrame with nary entropies, and one with counts of lexemes.
The result contains entropy \(H(c_{1}, ..., c_{n} \to c_{n+1} )\).
Values are computed for all unordered combinations of \((c_{1}, ..., c_{n+1})\) in the
PatternDistribution.paradigms
’s columns. Indexes are tuples \((c_{1}, ..., c_{n})\) and columns are the predicted cells \(c_{n+1}\).Example
For three cells c1, c2, c3, (n=2) entropy of c1, c2 → c3, noted \(H(c_{1}, c_{2} \to c_{3})\) is:
\[H( patterns_{c1, c3}, \; \; patterns_{c2, c3}\; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]Parameters: n (int) – number of predictors.
-
one_pred_distrib_log
(logfile, sanity_check=False)[source]¶ Print a log of the probability distribution for one predictor.
Writes down the distributions \(P( patterns_{c1, c2} | classes_{c1, c2} )\) for all unordered combinations of two column names in
PatternDistribution.paradigms
. Also writes the entropy of the distributions.Parameters: - logfile (
io.TextIO
) – Output flow on which to write. - sanity_check (bool) – Use a slower calculation to check that the results are exact.
- logfile (
-
read_entropy_from_file
(filename)[source]¶ Read already computed entropies from a file.
Parameters: filename – the file’s path.
-
value_check
(n, logfile=None)[source]¶ Check that predicting from n predictors isn’t harder than with less.
Check that the value of entropy from n predictors c1, ….cn is lower than the entropy from n-1 predictors c1, …, cn-1 (for all computed n preds entropies).
Parameters: - n – number of predictors.
- logfile (
io.TextIOWrapper
) – Output flow on which to write the detail of the result (optional).
- paradigms (
-
class
entropy.distribution.
SplitPatternDistribution
(paradigms_list, patterns_list, pat_dic_list, names, logfile=None, features=None)[source]¶ Bases:
entropy.distribution.PatternDistribution
Implicative entropy distribution for split systems
Split system entropy is the joint entropy on both systems.
entropy.utils module¶
-
entropy.utils.
P
(x, subset=None)[source]¶ Return the probability distribution of elements in a
pandas.core.series.Series
.Parameters: - x (
pandas.core.series.Series
) – A series of data. - subset (iterable) – Only give the distribution for a subset of values.
Returns: A
pandas.core.series.Series
which index are x’s elements and which values are their probability in x.- x (
-
entropy.utils.
cond_P
(A, B, subset=None)[source]¶ Return the conditional probability distribution P(A|B) for elements in two
pandas.core.series.Series
.Parameters: - A (
pandas.core.series.Series
) – A series of data. - B (
pandas.core.series.Series
) – A series of data. - subset (iterable) – Only give the distribution for a subset of values.
Returns: A
pandas.core.series.Series
whith two indexes. The first index is from the elements of B, the second from the elements of A. The values are the P(A|B).- A (
-
entropy.utils.
cond_entropy
(A, B, **kwargs)[source]¶ - Calculate the conditional entropy between two series of data points.
- Presupposes that values in the series are of the same type, typically tuples.
Parameters: - A (
pandas.core.series.Series
) – A series of data. - B (
pandas.core.series.Series
) – A series of data.
Returns: H(A|B)
Module contents¶
lattice package¶
Submodules¶
lattice.lattice module¶
-
class
lattice.lattice.
ICLattice
(dataframe, leaves, annotate=None, dummy_formatter=None, keep_names=True, comp_prefix=None, col_formatter=None, na_value=None, AOC=False, collections=False, verbose=True)[source]¶ Bases:
object
Inflection Class Lattice.
This is a wrapper around (
concepts.Context
).-
__init__
(dataframe, leaves, annotate=None, dummy_formatter=None, keep_names=True, comp_prefix=None, col_formatter=None, na_value=None, AOC=False, collections=False, verbose=True)[source]¶ Parameters: - dataframe (
pandas.DataFrame
) – A dataframe - leaves (dict) – Dictionnaire de microclasses
- annotate (dict) – Extra annotations to add on lattice. Of the form: {<object label>:<annotation>}
- dummy_formatter (func) – Function to make dummies from the table. (default to panda’s)
- keep_names (bool) – whether to keep original column names when dropping duplicate dummy columns.
- comp_prefix (str) – If there are two sets of properties, the prefix used to distinguish column names.
- AOC (bool) – Whether to limit ourselves to Attribute or Object Concepts.
- col_formatter (func) – Function to format columns in the context table.
- na_value – A value tu use as “Na”. Defaults to None
- collections (bool) – Whether the table contains
representations.patterns.PatternCollection
objects.
- dataframe (
-
draw
(filename, title='Lattice', **kwargs)[source]¶ Draw the lattice using
clustering.Node
’s drawing function.
-
parents
(identifier)[source]¶ Return all direct parents of a node which corresponds to the identifier.
-
-
lattice.lattice.
to_dummies
(table, **kwargs)[source]¶ Make a context table from a dataframe.
Parameters: table ( pandas.DataFrame
) – A dataframe of patterns or stringsReturns: A context table. Return type: dummies ( pandas.DataFrame
)
Module contents¶
representations package¶
Submodules¶
representations.alignment module¶
author: Sacha Beniamine.
This module is used to align sequences.
-
representations.alignment.
align_auto
(s1, s2, insert_cost, sub_cost, distance_only=False, fillvalue='', **kwargs)[source]¶ Return all the best alignments of two words according to some edit distance matrix.
Parameters: - s1 (str) – first word to align
- s2 (str) – second word to align
- insert_cost (func) – A function which takes one value and returns an insertion cost
- sub_cost (func) – A function which takes two values and returns a substitution cost
- distance_only (bool) – defaults to False. If True, returns only the best distance. If False, returns an alignment.
- fillvalue – (optional) the value with which to pad when iterable have varying lengths. Default: “”.
Returns: Either an alignment (a list of list of zipped tuples), or a distance (if distance_only is True).
-
representations.alignment.
align_baseline
(*args, **kwargs)[source]¶ Simple alignment intended as an inflectional baseline. (Albright & Hayes 2002)
single change, either suffixal, or suffixal, or infixal. This doesn’t work well when there is both a prefix and a suffix. Used as a baseline for evaluation of the auto-aligned patterns.
see “Modeling English Past Tense Intuitions with Minimal Generalization”, Albright, A. & Hayes, B. Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning - Volume 6, Association for Computational Linguistics, 2002, 58-69, page 2 :
“The exact procedure for finding a word-specific rule is as follows: given an input pair (X, Y), the model first finds the maximal left-side substring shared by the two forms (e.g., #mɪs), to create the C term (left side context). The model then exam- ines the remaining material and finds the maximal substring shared on the right side, to create the D term (right side context). The remaining material is the change; the non-shared string from the first form is the A term, and from the second form is the B term.”Examples
>>> align_baseline("mɪs","mas") [('m', 'm'), ('ɪ', 'a'), ('s', 's')] >>> align_baseline("mɪs","mɪst") [('m', 'm'), ('ɪ', 'ɪ'), ('s', 's'), ('', 't')] >>> align_baseline("mɪs","amɪs") [('', 'a'), ('m', 'm'), ('ɪ', 'ɪ'), ('s', 's')] >>> align_baseline("mɪst","amɪs") [('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
Parameters: - *args – any number of iterables >= 2
- fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.
Returns: a list of zipped tuples.
-
representations.alignment.
align_left
(*args, **kwargs)[source]¶ Align left all arguments (wrapper around zip_longest).
Examples
>>> align_left("mɪs","mas") [('m', 'm'), ('ɪ', 'a'), ('s', 's')] >>> align_left("mɪs","mɪst") [('m', 'm'), ('ɪ', 'ɪ'), ('s', 's'), ('', 't')] >>> align_left("mɪs","amɪs") [('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('', 's')] >>> align_left("mɪst","amɪs") [('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
Parameters: - *args – any number of iterables >= 2
- fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.
Returns: a list of zipped tuples, left aligned.
-
representations.alignment.
align_multi
(*strings, **kwargs)[source]¶ Levenshtein-style alignment over arguments, two by two.
-
representations.alignment.
align_right
(*iterables, **kwargs)[source]¶ Align right all arguments. Zip longest with right alignment.
Examples
>>> align_right("mɪs","mas") [('m', 'm'), ('ɪ', 'a'), ('s', 's')] >>> align_right("mɪs","mɪst") [('', 'm'), ('m', 'ɪ'), ('ɪ', 's'), ('s', 't')] >>> align_right("mɪs","amɪs") [('', 'a'), ('m', 'm'), ('ɪ', 'ɪ'), ('s', 's')] >>> align_right("mɪst","amɪs") [('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
Parameters: - *iterables – any number of iterables >= 2
- fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.
Returns: a list of zipped tuples, right aligned.
-
representations.alignment.
commonprefix
(*args)[source]¶ Given a list of strings, returns the longest common prefix
representations.confusables module¶
author: Sacha Beniamine.
This module is used to get characters similar to other utf8 characters.
representations.contexts module¶
author: Sacha Beniamine.
This module implements patterns’ contexts, which are series of phonological restrictions.
-
class
representations.contexts.
Context
(segments)[source]¶ Bases:
object
Context for an alternation pattern
representations.generalize module¶
author: Sacha Beniamine.
This module is used to generalise pats contexts.
-
representations.generalize.
generalize_patterns
(pats, debug=False)[source]¶ Generalize these patterns’ context.
Parameters: - patterns – an iterable of
Patterns.Pattern
- debug – whether to print debug strings.
Returns: a new
Patterns.Pattern
.- patterns – an iterable of
-
representations.generalize.
incremental_generalize_patterns
(*args)[source]¶ Merge patterns incrementally as long as the pattern has the same coverage.
Attempt to merge each patterns two by two, and refrain from doing so if the pattern doesn’t match all the lexemes that lead to its inference. Also attempt to merge together patterns that have not been merged with others.
Parameters: *args – the patterns Returns: a list of patterns, at best of length 1, at worst of the same length as the input.
representations.patterns module¶
author: Sacha Beniamine.
This module addresses the modeling of inflectional alternation patterns.
-
class
representations.patterns.
BinaryPattern
(*args, **kwargs)[source]¶ Bases:
representations.patterns.Pattern
Represent the alternation pattern between two forms.
A BinaryPattern is a Patterns.Pattern over just two forms. Applying the pattern to one of the original forms yields the second one.
As an example, we will use the following alternation in a present verb of french:
cells Forms Transcription prs.1.sg ⇌ prs.2.pl j’amène ⇌ vous amenez amEn ⇌ amənE Example
>>> cells = ("prs.1.sg", "prs.2.pl") >>> forms = ("amEn", "amənE") >>> p = Pattern(cells, forms, aligned=False) >>> type(p) representations.patterns.BinaryPattern >>> p E_ ⇌ ə_E / am_n_ <0> >>> p.apply("amEn",cells) 'amənE'
-
applicable
(form, cell)[source]¶ Test if this pattern matches a form, i.e. if the pattern is applicable to the form.
Parameters: Returns: whether the pattern is applicable to the form from that cell.
Return type: bool
-
apply
(form, names, raiseOnFail=True)[source]¶ Apply the pattern to a form.
Parameters: - form – a form, assumed to belong to the cell names[0].
- names – apply to a form of cell names[0] to produce a form of cell names[1] (default:self.cells). Patterns being non-oriented, it is better to use the names argument.
- raiseOnFail (bool) – defaults to True. If true, raise an error when the pattern is not applicable to the form. If False, return None instead.
Returns: form belonging the opposite cell.
-
-
exception
representations.patterns.
NotApplicable
[source]¶ Bases:
Exception
Raised when a
patterns.Pattern
can’t be applied to a form.
-
class
representations.patterns.
Pattern
(cells, forms, aligned=False, **kwargs)[source]¶ Bases:
object
Represent an alternation pattern and its context.
The pattern can be defined over an arbitrary number of forms. If there are only two forms, a
patterns.BinaryPattern
will be created.cells (tuple): Cell labels.
- alternation (dict of str: list of tuple):
- Maps the cell’s names to a list of tuples of alternating material.
- context (tuple of str):
- Sequence of (str, Quantifier) pairs or “{}” (stands for alternating material.)
- score (float):
- A score used to choose among patterns.
Example
>>> cells = ("prs.1.sg", "prs.1.pl","prs.2.pl") >>> forms = ("amEn", "amənõ", "amənE") >>> p = patterns.Pattern(cells, forms, aligned=False) >>> p E_ ⇌ ə_ɔ̃ ⇌ ə_E / am_n_ <0>
-
__init__
(cells, forms, aligned=False, **kwargs)[source]¶ Constructor for Pattern.
Parameters: - cells (iterable) – Cells labels (str), in the same order.
- forms (iterable) – Forms (str) to be segmented.
- aligned (bool) – whether forms are already aligned. Otherwise, left alignment will be performed.
-
class
representations.patterns.
PatternCollection
(collection)[source]¶ Bases:
object
Represent a set of patterns.
-
representations.patterns.
are_all_identical
(iterable)[source]¶ Test whether all elements in the iterable are identical.
-
representations.patterns.
find_alternations
(paradigms, method, **kwargs)[source]¶ Find local alternations in a Dataframe of paradigms.
For each pair of form in the paradigm, keep only the alternating material (words are left-aligned). Return the resulting DataFrame.
Parameters: - paradigms (pandas.DataFrame) – a dataframe containing inflectional paradigms. Columns are cells, and rows are lemmas.
- method (str) – “local” uses pairs of forms, “global” uses entire paradigms.
Returns: a dataframe with the same indexes as paradigms and as many columns as possible combinations of columns in paradigms, filled with segmented patterns.
Return type:
-
representations.patterns.
find_applicable
(paradigms, pat_dict, disable_tqdm=False, **kwargs)[source]¶ Find all applicable rules for each form.
We name sets of applicable rules classes. Classes are oriented: we produce two separate columns (a, b) and (b, a) for each pair of columns (a, b) in the paradigm.
Parameters: - paradigms (
pandas.DataFrame
) – paradigms (columns are cells, index are lemmas). - pat_dict (dict) – a dict mapping a column name to a list of patterns.
- disable_tqdm (bool) – if true, do not show progressbar
Returns: associating a lemma (index) and an ordered pair of paradigm cells (columns) to a tuple representing a class of applicable patterns.
Return type: - paradigms (
-
representations.patterns.
find_endings
(paradigms, *args, disable_tqdm=False, **kwargs)[source]¶ Find suffixes in a paradigm.
Return a DataFrame of endings where we remove in each row the common prefix to all the row’s cells.
Parameters: - paradigms (pandas.DataFrame) – a dataframe containing inflectional paradigms. Columns are cells, and rows are lemmas.
- disable_tqdm (bool) – if true, do not show progressbar
Returns: a dataframe of the same shape filled with segmented endings.
Return type: Example
>>> df = pd.DataFrame([["amEn", "amEn", "amEn", "amənõ", "amənE", "amEn"]], columns=["prs.1.sg", "prs.2.sg", "prs.3.sg", "prs.1.pl", "prs.2.pl","prs.3.pl"], index=["amener"]) >>> df prs.1.sg prs.2.sg prs.3.sg prs.1.pl prs.2.pl prs.3.pl amener amEn amEn amEn amənõ amənE amEn >>> find_endings(df) prs.1.sg prs.2.sg prs.3.sg prs.1.pl prs.2.pl prs.3.pl amener En En En ənõ ənE En
-
representations.patterns.
find_patterns
(paradigms, method, **kwargs)[source]¶ Find Patterns in a DataFrame according to any general method.
- Methods can be:
- suffix (align left),
- prefix (align right),
- baseline (see Albright & Hayes 2002)
- levenshtein (dynamic alignment using levenshtein scores)
- similarity (dynamic alignment using segment similarity scores)
Parameters: - paradigms (
pandas.DataFrame
) – paradigms (columns are cells, index are lemmas). - method (str) – “suffix”, “prefix”, “baseline”, “levenshtein” or “similarity”
Returns: patterns,pattern_dict. Patterns is the created
pandas.DataFrame
, pat_dict is a dict mapping a column name to a list of patterns.Return type: (tuple)
-
representations.patterns.
from_csv
(filename, defective=True, overabundant=True)[source]¶ Read a Patterns Dataframe from a csv
representations.quantity module¶
author: Sacha Beniamine.
This module provides Quantity objects to represent quantifiers.
-
class
representations.quantity.
Quantity
(mini, maxi)[source]¶ Bases:
object
Represents a quantifier as an interval.
This is a flyweight class and the presets are :
description mini maxi regex symbol variable name Match one 1 1 quantity.one Optional 0 1 ? quantity.optional Some 1 inf + quantity.some Any 0 inf * quantity.kleenestar None 0 0
-
representations.quantity.
quantity_largest
(args)[source]¶ Reduce on the “&” operator of quantities.
Returns a quantity with the minimum left value and maximum right value.
Example
>>> quantity_largest([Quantity(0,1),Quantity(1,1),Quantity(1,np.inf)]) Quantity(0,np.inf)
- Argument:
- args: an iterable of quantities.
-
representations.quantity.
quantity_sum
(args)[source]¶ Reduce on the “+” operator of quantities.
Returns a quantity with the minimum left value and the sum of the right value.
Example
>>> quantity_largest([Quantity(0,1),Quantity(1,1),Quantity(0,0)]) Quantity(0,1)
- Argument:
- args: an iterable of quantities.
representations.segments module¶
author: Sacha Beniamine.
This module addresses the modelisation of phonological segments.
-
class
representations.segments.
Segment
(classes, features, alias, chars, shorthand=None)[source]¶ Bases:
object
The Segments.Segment class holds the definition of a single segment.
This is a lightweight class.
Variables: - name (str or _CharClass) – Name of the segment.
- features (frozenset of tuples) – The tuples are of the form (attribute, value) with a positive value, used for set operations.
-
classmethod
get_from_transform
(a, transform)[source]¶ Get a segment from another according to a transformation tuple.
In the following example, the segments have been initialized with French segment definitions.
Parameters: Example
>>> Segment.get_from_transform("d",("bdpt", "fsvz")) 'z'
-
classmethod
get_transform_features
(left, right)[source]¶ Get the features corresponding to a transformation.
Parameters: Example
>>> Segment.get_from_transform("bd", "pt") {'+vois'}, {'-vois'}
-
classmethod
init_dissimilarity_matrix
(gap_prop=0.24, **kwargs)[source]¶ Compute score matrix with dissimilarity scores.
-
classmethod
intersect
(*args)[source]¶ Intersect some segments from their names/aliases. This is the “meet” operation on the lattice nodes, and returns the lowest common ancestor.
Returns: a str or _CharClass representing the segment which classes are the intersection of the input.
-
classmethod
show_pool
(only_single=False)[source]¶ Return a string description of the whole segment pool.
-
similarity
[source]¶ Compute phonological similarity (Frisch, 2004)
The function is memoized. Measure from “Similarity avoidance and the OCP” , Frisch, S. A.; Pierrehumbert, J. B. & Broe, M. B. Natural Language & Linguistic Theory, Springer, 2004, 22, 179-228, p. 198.
We compute similarity by comparing the number of shared and unshared natural classes of two consonants, using the equation in (7). This equation is a direct extension of the Pierrehumbert (1993) feature similarity metric to the case of natural classes.
- \(Similarity = \frac{\text{Shared natural classes}}{\text{Shared natural classes } + \text{Non-shared natural classes}}\)
-
classmethod
transformation
(a, b)[source]¶ Find a transformation between aliases a and b.
The transformation is a pair of two maximal sets of segments related by a bijective phonological function.
This function takes a pair of strings representing segments. It calculates the function which relates these two segments. It then finds the two maximal sets of segments related by this function.
Example
In French, t -> s can be expressed by a phonological function which changes [-cont] and [-rel. ret] to [+cont] and [+rel. ret]
These other segments are related by the same change: d -> z b -> v p -> f
>>> a,b = Segment.transformation("t","s") >>> print(a,b) [bdpt] [fsvz]
Parameters: a,b (str) – Segment aliases. Returns: two charclasses.
-
representations.segments.
make_aliases
(ipa)[source]¶ Associate one symbol to segments that take two characters. Return restoration map.
This function takes a segments table and changes the entries of the “Segs.” column with a unique character for each multi-chars cell in the Segs. A dict is returned that allows for original segment name restoration.
Input Segs. Output Segs. ɑ̃ â a a The table can have an optional UNICODE column. It will be dropped at the end of the process.
Parameters: ipa ( pandas.DataFrame
) – Dataframe of segments. Columns are features and indexes are segments. A UNICODE col can specify alt chars.Returns: maps from the simplified name to the original segments name. Return type: alias_map (dict)
-
representations.segments.
normalize
(ipa, features)[source]¶ Assign a normalized segment to groups of segments with identical rows.
This function takes a segments table and adds in place a “Normalized” column. This column contains a common value for each segment with identical boolean values. The function also returns a translation table mapping indexes to normalized segments.
Note: the index are expected to be one char length.
Index ..features.. Normalized ɛ […] E e […] E Parameters: - ipa (
pandas.DataFrame
) – Dataframe of segments. Columns are features, UNICODE code point representation and segment names, indexes are segments. - features (list) – Feature columns’ names.
Returns: translation table from the segment’s nameto its normalized name.
Return type: norm_map (dict)
- ipa (
representations.utils module¶
author: Sacha Beniamine.
Utility functions for representations.
-
representations.utils.
create_features
(data_file_name)[source]¶ Read feature and preprocess to be coindexed with paradigms.
-
representations.utils.
create_paradigms
(data_file_name, cols=None, verbose=False, fillna=True, segcheck=False, merge_duplicates=False, defective=False, overabundant=False, merge_cols=False)[source]¶ Read paradigms data, and prepare it according to a Segment class pool.
Parameters: - data_file_name (str) – path to the paradigm csv file.
- characters occuring in the paradigms except the first column (All) –
- be inventoried in this class. (should) –
- cols (list of str) – a subset of columns to use from the paradigm file.
- verbose (bool) – verbosity switch.
- merge_duplicates (bool) – should identical columns be merged ?
- fillna (bool) – Defaults to True. Should #DEF# be replaced by np.NaN ? Otherwise they are filled with empty strings (“”).
- segcheck (bool) – Defaults to False. Should I check that all the phonological segments in the table are defined in the segments table ?
- defective (bool) – Defaults to False. Should I keep rows with defective forms ?
- overabundant (bool) – Defaults to False. Should I keep rows with overabundant forms ?
- merge_cols (bool) – Defaults to False. Should I merge identical columns (fully syncretic) ?
Returns: paradigms table (columns are cells, index are lemmas).
Return type: paradigms (
pandas.DataFrame
)
-
representations.utils.
normalize_dataframe
(paradigms, aliases, normalization, verbose=False)[source]¶ Normalize and Simplify a dataframe.
- aliases:
- For all sequence of n characters representing a segment, replace with a unique character representing this segment.
- Normalization:
- for all groups of characters representing the same feature set, translate to one unique character.
- Note:
- a .translate strategy works for normalization but not for aliases, since it only maps single characters to single characters. The order of operations is important, since .translate assumes mapping of 1: 1 chars.
Parameters: - paradigms (
pandas.DataFrame
) – paradigms table (columns are cells, index are lemmas). - aliases (dict) – dictionnary of segments (as found in the paradigms) to their aliased versions (one char length)
- normalization (dict) – dictionnary of 1 aliased character to another, to replace segments which have the same feature set.
- verbose (bool) – verbosity switch.
Returns: The same dataframe, normalized and simplified.
Return type: new_df (
pandas.DataFrame
)
Module contents¶
utils package¶
Module contents¶
-
utils.
get_repository_version
()[source]¶ Return an ID for the current git or svn revision.
If the directory isn’t under git or svn, the function returns an empty str.
- Returns:
- (str): svn/git version or ‘’.
-
utils.
merge_duplicate_columns
(df, sep=';', keep_names=True)[source]¶ Merge duplicate columns and return new DataFrame.
Parameters: - df (
pandas.DataFrame
) – A dataframe - sep (str) – separator to use when joining columns names.
- keep_names (bool) – Whether to keep the names of the original duplicated columns by merging them onto the columns we keep.
- df (