Qumin: Quantitative modelling of inflection

Qumin (QUantitative Modelling of INflection) is a collection of scripts for the computational modelling of the inflectional morphology of languages. It was developed by me (Sacha Beniamine) for my PhD, which was supervised by Olivier Bonami .

The documentation has moved to ReadTheDocs at: https://qumin.readthedocs.io/

For more detail, you can refer to my dissertation (in French):

Sacha Beniamine. Classifications flexionnelles. Étude quantitative des structures de paradigmes. Linguistique. Université Sorbonne Paris Cité - Université Paris Diderot (Paris 7), 2018. Français.

Quick Start

Install

First, open the terminal and navigate to the folder where you want the Qumin code. Clone the repository from github:

git clone https://github.com/XachaB/Qumin.git

Make sure to have all the python dependencies installed. The dependencies are listed in environment.yml. A simple solution is to use conda and create a new environment from the environment.yml file:

conda env create -f environment.yml

There is now a new conda environment named Qumin. It needs to be activated before using any Qumin script:

conda activate Qumin

Data

The scripts expect full paradigm data in phonemic transcription, as well as a feature key for the transcription.

To provide a data sample in the correct format, Qumin includes a subset of the French flexique lexicon, distributed under a Creative Commons Attribution-NonCommercial-ShareAlike license.

For Russian nouns, see the Inflected lexicon of Russian Nouns in IPA notation.

Scripts

Patterns

Alternation patterns serve as a basis for all the other scripts. The algorithm to find the patterns was presented in: Sacha Beniamine. Un algorithme universel pour l’abstraction automatique d’alternances morphophonologiques 24e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), Jun 2017, Orléans, France. 2 (2017), 24e Conférence sur le Traitement Automatique des Langues Naturelles.

Computing automatically aligned patterns for paradigm entropy or macroclass:

bin/$ python3 find_patterns.py <paradigm.csv> <segments.csv>

Computing automatically aligned patterns for lattices:

bin/$ python3 find_patterns.py -d -o <paradigm.csv> <segments.csv>

Microclasses

To visualize the microclasses and their similarities, you can use the new script microclass_heatmap.py:

Computing a microclass heatmap:

bin/$ python3 microclass_heatmap.py <paradigm.csv> <output_path>

Computing a microclass heatmap, comparing with class labels:

bin/$ python3 microclass_heatmap.py -l  <labels.csv> -- <paradigm.csv> <output_path>

The labels file is a csv file. The first column give lexemes names, the second column provides inflection class labels. This allows to visually compare a manual classification with pattern-based similarity. This script relies heavily on seaborn’s clustermap function.

Paradigm entropy

This script was used in:

Computing entropies from one cell

bin/$ python3 calc_paradigm_entropy.py -n 1 -- <patterns.csv> <paradigm.csv> <segments.csv>

Computing entropies from two cells (you can specify any number of predictors, e.g. -n 1 2 3 works too)

bin/$ python3 calc_paradigm_entropy.py -n 2 -- <patterns.csv> <paradigm.csv> <segments.csv>

Add a file with features to help prediction (for example gender – features will be added to the known information when predicting)

bin/$ python3 calc_paradigm_entropy.py -n 2 --features <features.csv> -- <patterns.csv> <paradigm.csv> <segments.csv>

Macroclass inference

Our work on automatical inference of macroclasses was published in Beniamine, Sacha, Olivier Bonami, and Benoît Sagot. “Inferring Inflection Classes with Description Length.” Journal of Language Modelling (2018).

Inferring macroclasses

bin/$ python3 find_macroclasses.py  <patterns.csv> <segments.csv>

Lattices

This script was used in:

Inferring a lattice of inflection classes, with html output

bin/$ python3 make_lattice.py --html <patterns.csv> <segments.csv>

Documentation index

The morphological paradigms file

This file relates phonological forms to their lexemes and paradigm cells. As an example of valid data, Qumin is shipped with a paradigm table from the French inflectional lexicon Flexique. Here is a sample of the first 10 columns for 10 randomly picked verbs from Flexique:

lexeme variants prs.1sg prs.2sg prs.3sg prs.1pl prs.2pl prs.3pl ipfv.1sg ipfv.2sg ipfv.3sg
peler peler pɛl pɛl pɛl pəlɔ̃ pəle pɛl pəlE pəlE pəlE
soudoyer soudoyer sudwa sudwa sudwa sudwajɔ̃ sudwaje sudwa sudwajE sudwajE sudwajE
inféoder inféoder ɛ̃fEɔd ɛ̃fEɔd ɛ̃fEɔd ɛ̃fEOdɔ̃ ɛ̃fEOde ɛ̃fEɔd ɛ̃fEOdE ɛ̃fEOdE ɛ̃fEOdE
débiller débiller dEbij dEbij dEbij dEbijɔ̃ dEbije dEbij dEbijE dEbijE dEbijE
désigner désigner dEziɲ dEziɲ dEziɲ dEziɲɔ̃ dEziɲe dEziɲ dEziɲE dEziɲE dEziɲE
crachoter crachoter kʁaʃɔt kʁaʃɔt kʁaʃɔt kʁaʃOtɔ̃ kʁaʃOte kʁaʃɔt kʁaʃOtE kʁaʃOtE kʁaʃOtE
saouler saouler:soûler sul sul sul sulɔ̃ sule sul sulE sulE sulE
caserner caserner kazɛʁn kazɛʁn kazɛʁn kazɛʁnɔ̃ kazɛʁne kazɛʁn kazɛʁnE kazɛʁnE kazɛʁnE
parrainer parrainer paʁɛn paʁɛn paʁɛn paʁEnɔ̃ paʁEne paʁɛn paʁEnE paʁEnE paʁEnE
souscrire souscrire suskʁi suskʁi suskʁi suskʁivɔ̃ suskʁive suskʁiv suskʁivE suskʁivE suskʁivE

Paradigm files are written in wide format:

  • each row represents a lexeme, and each column represents a cell.
  • The first column indicates a unique identifier for each lexeme. It is usually convenient to use orthographic citation forms for this purpose (e.g. infinitive for verbs).
  • In Vlexique, there is a second column with orthographic variants for lexeme names, which is called “variants”. You do not need to add a “variant” column, and if it is there, it will be ignored.
  • the very first row indicates the names of the cells as column headers. Columns headers shouldn’t contain the character “#”.

While Qumin assumes that inflected forms are written in some phonemic notation (we suggest to be as close to the IPA as possible), you do not need to explicitely segment them into phonemes in the paradigms file.

The file itself is a csv, meaning that the values are written as plain text, in utf-8 format, separated by spaces. This format can be read by spreadsheet programs as well as programmatically:

%%sh
head -n 3 "../Data/Vlexique/vlexique-20171031.csv"
lexeme,variants,prs.1sg,prs.2sg,prs.3sg,prs.1pl,prs.2pl,prs.3pl,ipfv.1sg,ipfv.2sg,ipfv.3sg,ipfv.1pl,ipfv.2pl,ipfv.3pl,fut.1sg,fut.2sg,fut.3sg,fut.1pl,fut.2pl,fut.3pl,cond.1sg,cond.2sg,cond.3sg,cond.1pl,cond.2pl,cond.3pl,sbjv.1sg,sbjv.2sg,sbjv.3sg,sbjv.1pl,sbjv.2pl,sbjv.3pl,pst.1sg,pst.2sg,pst.3sg,pst.1pl,pst.2pl,pst.3pl,pst.sbjv.1sg,pst.sbjv.2sg,pst.sbjv.3sg,pst.sbjv.1pl,pst.sbjv.2pl,pst.sbjv.3pl,imp.2sg,imp.1pl,imp.2pl,inf,prs.ptcp,pst.ptcp.m.sg,pst.ptcp.m.pl,pst.ptcp.f.sg,pst.ptcp.f.pl
accroire,accroire,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#,akʁwaʁ,#DEF#,#DEF#,#DEF#,#DEF#,#DEF#
advenir,advenir,#DEF#,#DEF#,advjɛ̃,#DEF#,#DEF#,advjɛn,#DEF#,#DEF#,advənE,#DEF#,#DEF#,advənE,#DEF#,#DEF#,advjɛ̃dʁa,#DEF#,#DEF#,advjɛ̃dʁɔ̃,#DEF#,#DEF#,advjɛ̃dʁE,#DEF#,#DEF#,advjɛ̃dʁE,#DEF#,#DEF#,advjɛn,#DEF#,#DEF#,advjɛn,#DEF#,#DEF#,advɛ̃,#DEF#,#DEF#,advɛ̃ʁ,#DEF#,#DEF#,advɛ̃,#DEF#,#DEF#,advɛ̃s,#DEF#,#DEF#,#DEF#,advəniʁ,advənɑ̃,advəny,advəny,advəny,advəny
Overabundance

Inflectional paradigms sometimes have some overabundant forms, where the same lexeme and paradigm cell can be realized in various ways, as in “dreamed” vs “dreamt” for the English past of “to dream”. Concurrent forms can be written in the same cell, separated by “;”. Only some scripts can make use of this information, the other scripts will use the first value only. Here is an example for English verbs:

lexeme ppart pres3s prespart inf pres1s presothers past13 pastnot13
bind baˑɪnd;baˑʊnd baˑɪndz baˑɪndɪŋ baˑɪnd baˑɪnd baˑɪnd baˑɪnd;baˑʊnd baˑɪnd;baˑʊnd
wind(air) waˑʊnd;waˑɪndɪd waˑɪndz waˑɪndɪŋ waˑɪnd waˑɪnd waˑɪnd waˑʊnd;waˑɪndɪd waˑʊnd;waˑɪndɪd
weave wəˑʊvn̩;wiːvd wiːvz wiːvɪŋ wiːv wiːv wiːv wəˑʊv;wiːvd wəˑʊv;wiːvd
slink slʌŋk;slæŋk;slɪŋkt slɪŋks slɪŋkɪŋ slɪŋk slɪŋk slɪŋk slʌŋk;slæŋk;slɪŋkt slʌŋk;slæŋk;slɪŋkt
dream driːmd driːmz driːmɪŋ driːm driːm driːm driːmd;drɛmt driːmd;drɛmt
Defectivity

On the contrary, some lexemes might be defective for some cells, and have no values whatsoever for these cells. The most explicit way to indicate these missing values is to write “#DEF#” in the cell. The cell can also be left empty. Note that some scripts ignore all lines with defective values.

Here are some examples from French verbs:

lexeme prs.1sg prs.2sg prs.3sg prs.1pl prs.2pl prs.3pl ipfv.1sg ipfv.2sg
accroire #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF#
advenir #DEF# #DEF# advjɛ̃ #DEF# #DEF# advjɛn #DEF# #DEF#
ardre #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# aʁdE aʁdE
braire #DEF# #DEF# bʁE #DEF# #DEF# bʁE #DEF# #DEF#
chaloir #DEF# #DEF# ʃo #DEF# #DEF# #DEF# #DEF# #DEF#
comparoir #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF#
discontinuer #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF#
douer #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF#
échoir #DEF# #DEF# eʃwa #DEF# #DEF# #DEF# #DEF# #DEF#
endêver #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF# #DEF#

The phonological segments file

Qumin works from the assumption that your paradigms are written in phonemic notation. The phonological segments file provides a list of phonemes and their decomposition into distinctive features. This file is first used to segment the paradigms into sequences of phonemes (rather than sequences of characters). Then, the distinctive features are used to recognize phonological similarity and natural classes when creating and handling alternation patterns.

To create a new segments file, the best is usually to refer to an authoritative description, and adapt it to the needs of the specific dataset. In the absence of such a description, I suggest to make use of Bruce Hayes’ spreadsheet as a starting point (he writes +, - and 0 for our 1,0 and -1).

Format

The segments file is also written in wide format, with each row describing a phoneme. The first column gives phonemes as they are written in the paradigms file. Each column represents a distinctive feature. Here is an example with just 10 rows of the segments table for French verbs:

Seg. sonant syllabique consonantique continu nasal haut bas arrière arrondi antérieur CORONAL voisé rel.ret.
p 0 0 1 0 0 0   0   1   0 0
b 0 0 1 0 0 0   0   1   1 0
t 0 0 1 0 0 0   0   1 1 0 0
s 0 0 1 1 0 0   0   1 1 0 1
i 1 1 0 1 0 1 0 0 0     1 1
y 1 1 0 1 0 1 0 0 1     1 1
u 1 1 0 1 0 1 0 1 1     1 1
o 1 1 0 1 0 0   1 1     1 1
a 1 1 0 1 0 0 1 1 0     1 1
ɑ̃ 1 1 0 1 1 0 1 1 0     1 1

Some conventions:

  • The first column must be called Seg..
  • The phonological symbols, in the Seg. column cannot be one of the reserved character : . ^ $ * + ? { } [ ] / | ( ) < > _  , ;.
  • If the file contains a “value” column, it will be ignored. This is used to provide a human-readable description of segments, which can be useful when preparing the data.
  • In order to provide short names for the features, as in [+nas] rather than [+nasal], you can add a second level of header, also beginning by Seg., which gives abbreviated names:
Seg. sonant syllabique consonantique continu nasal haut bas arrière arrondi antérieur CORONAL voisé rel.ret.
Seg. son syl cons cont nas haut bas arr rond ant COR vois rel.ret.
p 0 0 1 0 0 0   0   1   0 0
b 0 0 1 0 0 0   0   1   1 0

The file is encoded in utf-8 and can be either a csv table (preferred) or a tabulation separated table (tsv).

%%sh
head -n 6 "../Data/Vlexique/frenchipa.csv"
Seg.,sonant,syllabique,consonantique,continu,nasal,haut,bas,arrière,arrondi,antérieur,CORONAL,voisé,rel.ret.
Seg.,son,syl,cons,cont,nas,haut,bas,arr,rond,ant,COR,vois,rel.ret.
p,0,0,1,0,0,0,,0,,1,,0,0
b,0,0,1,0,0,0,,0,,1,,1,0
t,0,0,1,0,0,0,,0,,1,1,0,0
d,0,0,1,0,0,0,,0,,1,1,1,0
Segmentation and aliases

Since the forms in the paradigms are not segmented into phonemes, the phonological segments file is used to segment them.

It is possible to specify phonemes which are more than one character long, for example using combining characters, or for diphthongs and affricates. Be careful of using the same notation as in your paradigms. For example, you can not use “a” + combining tilde in one, and the precomposed “ã” in the other file, as the program would not recognize them as the same thing. You should however be certain that there is no segmentation ambiguity. If you have sequences such as “ABC” which should be segmented “AB.C” in some contexts and “A.BC” in some other contexts, you need to change the notation in the paradigms file so that it is not ambiguous, for example by writing “A͡BC” in the first case and “AB͡C” in the second case. You would then have separate rows for “A”, “A͡B”, “C” and “B͡C” in the segments file.

Internally, the program will use arbitrary aliases which are 1 character long to replace longer phonemes – this substitution will be reversed in the output. While this usually works without your intervention, you can provide your own aliases if you want to preserve some readability in debug logs. This is done by adding a column “ALIAS” right after the fist column, which holds 1-char aliases. This example shows a few rows for the segment files of navajo:

Seg. ALIAS syllabic htone long consonantal sonorant continuant delayed release
ɣ   0   0 1 0 1 1
k   0   0 1 0 0 0
k’ 0   0 1 0 0 0
k͡x K 0   0 1 0 0 1
t   0   0 1 0 0 0
ť   0   0 1 0 0 0
t͡ɬ L 0   0 1 0 0 1
t͡ɬ’ Ľ 0   0 1 0 0 1
t͡ɬʰ 0   0 1 0 0 1
ʦ   0   0 1 0 0 1
ʦ’ Ś 0   0 1 0 0 1
ʦʰ 0   0 1 0 0 1
ʧ H 0   0 1 0 0 1
ʧ’ 0   0 1 0 0 1
ʧʰ 0   0 1 0 0 1
t͡x T 0   0 1 0 0 1

If you have many multi-character phonemes, you may get the following error:

ValueError: ('I can not guess a good one-char alias for ã, please use an ALIAS column to provide one.',
            'occurred at index 41')

The solution is to add an alias for this character, and maybe a few others. To find aliases which vaguely resemble the proper symbols, this table of unicode characters organized by letter are often useful.

Shorthands

When writing phonological rules, linguists often use shorthands like “V” for the natural class of all vowels, and “C” for the natural class of all consonants. If you want, you can provide some extra rows in the table to define shorthand names for some natural classes. These names have to start and end by “#”. Here an example for the French segments file, giving shorthands for C (consonants), V (vowels) and G (glides):

Seg. sonant syllabique consonantique continu nasal haut bas arrière arrondi antérieur CORONAL voisé rel.ret.
Seg. son syl cons cont nas haut bas arr rond ant COR vois rel.ret.
#C#   0 1                    
#V# 1 1 0 1               1 1
#G# 1 0 0 1 0 1 0     0   1 1
Values of distinctive features

Distinctive features are usually considered to be bivalent: they can be either positive ([+nasal]) or negative ([-nasal]). In the Segments file, positive values are written by the number 1, and negative values by the number 0. Some features do not apply at all to some phonemes, for example consonants are neither [+round] nor [-round]. This can be written either by -1, or by leaving the cell empty. While the first is more explicit, leaving the cell empty makes the tables more readable at a glance. The same strategy is used for features which are privative, as for example [CORONAL]: there is no class of segments which are [-coronal], so we can write either 1 or -1 in the corresponding column, not using 0.

While 1, 0 and -1 (or nothing) are the values that make the most sense, any numeric values are technically allowed, for example [-back], [+back] and [++back] could be expressed by writing 0, 1, and 2 in the “back” column. I do not recommend doing this.

When writing segments file, it is important to be careful of the naturality of natural classes, as Qumin will take them at face value. For example, using the same [±high] feature for both vowels and consonants will result in a natural class of all the [+high] segments, and one for all the [-high] segments. Sometimes, it is better to duplicate some columns to avoid generating unfounded classes.

Monovalent or bivalent features

Frisch (1996) argues that monovalent features (using only -1 and 1) are to be preferred to bivalent features, as the latter implicitly generate natural classes for the complement features ([-coronal]), which is not always desirable. In Qumin, both monovalent and bivalent features are accepted. Internally, the program will expand all 1 and 0 into + and - values. As an example, take this table which classifies the three vowels /a/, /i/ and /u/:

Seg. high low front back round Non-round
Seg. high low front back round Non-round
a   1   1   1
i 1   1     1
u 1     1 1  

Internally, Qumin will construct the following table, which looks almost identical because we used monovalued features:

Seg. +high +low +front +back +round +Non-round
a   x   x   x
i x   x     x
u x     x x  

This will then result in the following natural class hierarchy:

Natural classes for three vowels

To visualize natural class hierarchies declared by segment files, you can use FeatureViz.

The same thing can be achieved with less columns using binary features:

Seg. high front round
Seg. high front round
a 0 0 0
i 1 1 0
u 1 0 1

Internally, these will be expanded to:

Seg. +high -high +front -front +round -round
a   x   x   x
i x   x     x
u x     x x  

Which is the same thing as previously, with different names. The class hierarchy is also very similar:

Natural classes for three vowels
Warning, some of the segments aren’t actual leaves

The following error occurs when the table is well formed, but specifies a natural class hierarchy which is not usable by Qumin:

Exception: Warning, some of the segments aren't actual leaves :
   p is the same node as [p-]
       [p-] ([]) = [+cons -son -syll +lab -round -voice -cg -cont -strid -lat -del.rel -nas -long]
        (ĸ) = [+cons -son -syll +lab -round +dor +highC -lowC +back -tense -voice -cg -cont -strid -lat -del.rel -nas -long]
   k is the same node as [k-]
       [k-] ([]) = [+cons -son -syll +dor +highC -lowC +back -tense -voice -cg -cont -strid -lat -del.rel -nas -long]
        (ĸ) = [+cons -son -syll +lab -round +dor +highC -lowC +back -tense -voice -cg -cont -strid -lat -del.rel -nas -long]

What happened here is that the natural class [p-kʷ] has the exact same definition as just /p/. Similarly, the natural class [k-kʷ] has the same definition as /k/. The result is the following structure, in which /p/ and /k/ are superclasses of /kʷ/:

erroneous structure

In this structure, it is impossible to distinguish the natural classes [p-kʷ] and [k-kʷ] from the respective ponemes /p/ and /k/. Instead, we want them to be one level lower. If we ignore the bottom node, this means that they should be leaves of the hierarchy.

The solution is to ensure that both /p/ and /k/ have at least one feature divergent from [kʷ]. Usually, kʷ is marked as [+round], but in the above it is mistakenly written [-round]. Correcting this definitions yields the following structure, and solves the error:

erroneous structure
Neutralizations

While having a segment be higher than another in the hierarchy is forbidden, it is possible to declare two segments with the exact same features. This is useful if you want to neutralize some oppositions, and ignore some details in the data.

For example, this set of French vowels display height oppositions using the [±low] feature:

Seg. sonant syllabique consonantique continu nasal haut bas arrière arrondi antérieur coronal voisé rel.ret.
Seg. son syl cons cont nas haut bas arr rond ant cor vois rel.ret.
e 1 1 0 1 0 0 0 0 0 -1 -1 1 1
ɛ 1 1 0 1 0 0 1 0 0 -1 -1 1 1
ø 1 1 0 1 0 0 0 0 1 -1 -1 1 1
œ 1 1 0 1 0 0 1 0 1 -1 -1 1 1
o 1 1 0 1 0 0 0 1 1 -1 -1 1 1
ɔ 1 1 0 1 0 0 1 1 1 -1 -1 1 1

Leading to this complex hierarchy:

_images/french_no_neutralizations.png

Due to regional variations, the French Vlexique sometimes neutralizes this oppositions, and writes E, Ø and O to underspecify the value of the vowels. The solution is to neutralize entirely the [±low] distinction for these vowels, writing repeated rows for E, e, ɛ, etc:

Seg. sonant syllabique consonantique continu nasal haut bas arrière arrondi antérieur coronal voisé rel.ret.
Seg. son syl cons cont nas haut bas arr rond ant cor vois rel.ret.
E 1 1 0 1 0 0 -1 0 0 -1 -1 1 1
e 1 1 0 1 0 0 -1 0 0 -1 -1 1 1
ɛ 1 1 0 1 0 0 -1 0 0 -1 -1 1 1
Ø 1 1 0 1 0 0 -1 0 1 -1 -1 1 1
ø 1 1 0 1 0 0 -1 0 1 -1 -1 1 1
œ 1 1 0 1 0 0 -1 0 1 -1 -1 1 1
O 1 1 0 1 0 0 -1 1 1 -1 -1 1 1
o 1 1 0 1 0 0 -1 1 1 -1 -1 1 1
ɔ 1 1 0 1 0 0 -1 1 1 -1 -1 1 1

Internally, Qumin will replace all of these identical characters by a single unified one (the first in the file). The simplified structure becomes:

_images/french_neutralizations.png
Creating scales

Rather than using many-valued features, it is often preferrable to use a few monovalent or bivalent features to create a scale. As an example, here is a possible (bad) implementation for tones, which uses a single feature “Tone”.

Seg. Tone
Seg. Tone
˥ 3
˦ 2
˧ 1
˨ 0

It results in this natural class hierarchy:

four tone coded on a single feature

While such a file is allowed, it results in the tones having nothing in common. If some morpho-phonological alternations selects both high and mid tones, we will miss that generalization.

To express a scale, a simple solution is to create one less feature than there are segments (here four tones lead to three scale features), then fill in the upper diagonal with 1 and the lower diagonal with 0 (or the opposite). For example:

Seg. scale1 scale2 scale3
Seg. scale1 scale2 scale3
˥ 1 1 1
˦ 0 1 1
˧ 0 0 1
˨ 0 0 0

It will result in the natural classes below:

tone scale

Since this is not very readable, we can re-write the same thing in a more readable way using a combination of binary and monovalent features:

Seg. Top High Low Bottom
Seg. Top High Low Bottom
˥ 1 1   0
˦ 0 1   0
˧ 0   1 0
˨ 0   1 1

Which leads to the same structure:

tone scale (more readable)

When implementing tones, I recommend to mark them all as [-segmental] to ensure that they share a common class, and to write all other features as [+segmental].

Diphthongs

Diphthongs are not usually decomposed using distinctive features, as they are complex sequences (see this question on the Linguist List). However, if diphthongs alternate with simple vowels in your data, adding diphthongs in the list of phonological segments can allow Qumin to capture better generalizations. The strategy I have employed so far is the following:

  • Write diphthongs in a non-ambiguous way in the data (either ‘aj’ or ‘aˑi’, but not ‘ai’ when the same sequence can sometimes be two vowels)
  • Copy the features from the initial vowel
  • Add a monovalent feature [DIPHTHONG]
  • Add monovalent features [DIPHTHONG_J], [DIPHTHONG_W], etc, as needed.

This is a small example for a few English diphthongs:

Seg. high low back LABIAL tense diphtong j diphtong ə diphtong W diphtong
Seg. high low back LAB tens diph.j diph.ə diph.w diph
a 0 1 0   1       0
aˑʊ 0 1 1   1     1 1
aˑɪ 0 1 1   1 1     1
ɪ 1 0 0   0       0
ɪˑə 1 0 0   0   1   1

Which leads to the following classes:

Small sample from English diphthongs
Others
  • Stress: I recommend to mark it directly on vowels, and duplicate the vowel inventory to have both stressed and unstressed counterpart. A simple binary [±stress] feature is enough to distinguish them.
  • Length: Similarly, I recommend to mark length, when possible, on vowels, rather than duplicating them.

Usages

Usage of bin/find_patterns.py

Find pairwise alternation patterns from paradigms. This is a preliminary step necessary to obtain patterns used as input in the three scripts below.

Computing automatically aligned patterns for paradigm entropy or macroclass:

bin/$ python3 find_patterns.py <paradigm.csv> <segments.csv>

Computing automatically aligned patterns for lattices:

bin/$ python3 find_patterns.py -d -o -c <paradigm.csv> <segments.csv>

The option -k allows one to choose the algorithm for inferring alternation patterns.

Option Description Strategy
endings Affixes Removes the longest common initial string for each row.
endingsPairs Pairs of affixes Endings, tabulated as pairs for all combinations of columns.
endingsDisc Discontinuous endings Removes the longest common substring, left aligned
….Alt Alternations Alternations have no contextes. These were used for comparing macroclass strategies on French and European Portuguese.
globalAlt Alternations As EndingsDisc, tabulated as pairs for all combinations of columns.
localAlt Alternations Inferred from local pairs of cells, left aligned.
patterns… Binary Patterns All patterns have alternations and generalized contexts. Various alignment strategies are offered for comparison. Arbitrary number of changes supported.
patternsLevenshtein Patterns Aligned with simple edit distance.
patternsPhonsim Patterns Aligned with edit distances based on phonological similarity.
patternsSuffix Patterns Fixed left alignment, only interesting for suffixal languages.
patternsPrefix Patterns Fixed right alignment, only interesting for prefixal languages.
patternsBaseline Patterns Baseline alignment, follows Albright & Hayes 2002. A single change, with a priority order: Suffixation > Prefixation > Stem-internal alternation (ablaut/infixation)

Most of these were implemented for comparison purposes. I recommend to use the default patternsPhonsim in most cases. To avoid relying on your phonological features files for alignment scores, use patternsLevenshtein. Only these two are full patterns with generalization both in the context and alternation.

For lattices, we keep defective and overabundant entries. We do not usually keep them for other applications. The latest code for entropy can handle defective entries. The file you should use as input for the below scripts has a name that ends in “_patterns”. The “_human_readable_patterns” file is nicer to review but is only meant for human usage.

Usage of bin/calc_paradigm_entropy.py

Compute entropies of flexional paradigms’ distributions.

Computing entropies from one cell

bin/$ python3 calc_paradigm_entropy.py -o <patterns.csv> <paradigm.csv> <segments.csv>

Computing entropies from one cell, with a split dataset

bin/$ python3 calc_paradigm_entropy.py -names <data1 name> <data2 name> -b <patterns1.csv> <paradigm1.csv> -o <patterns2.csv> <paradigm2.csv> <segments.csv>

Computing entropies from two cell

bin/$ python3 calc_paradigm_entropy.py -n 2 <patterns.csv> <paradigm.csv> <segments.csv>

More complete usage can be obtained by typing

bin/$ python3 calc_paradigm_entropy.py --help

With –nPreds and N>2 the computation can get quite long on large datasets.

Usage of bin/find_macroclasses.py

Cluster lexemes in macroclasses according to alternation patterns.

Inferring macroclasses

bin/$ python3 find_macroclasses.py  <patterns.csv> <segments.csv>

More complete usage can be obtained by typing

bin/$ python3 find_macroclasses.py --help

The options “-m UPGMA”, “-m CD” and “-m TD” are experimental and will not undergo further development, use at your own risks. The default is to use Description Length (DL) and a bottom-up algorithm (BU).

Usage of bin/make_lattice.py

Infer Inflection classes as a lattice from alternation patterns. This will produce a context and an interactive html file.

Inferring a lattice of inflection classes, with html output

bin/$ python3 make_lattice.py --html <patterns.csv> <segments.csv>

More complete usage can be obtained by typing

bin/$ python3 make_lattice.py --help

bin

clustering package
Submodules
clustering.algorithms module

Algorithms for inflection classes clustering.

Author: Sacha Beniamine

clustering.algorithms.bottom_up_clustering(patterns, microclasses, Clusters, **kwargs)[source]

Cluster microclasses in a top-down recursive fashion.

The algorithm is the following:

Begin with one cluster per microclasses.
While there is more than one cluster :
    Find the best possible merge of two clusters, among all possible pairs.
    Perform this merge

Scoring, finding the best merges, merging nodes depends on the Clusters class.

Parameters:
  • patterns (pandas.DataFrame) – a dataframe of patterns.
  • (dict of str (microclasses) – list): mapping of microclasses exemplars to microclasses inventories.
  • Clusters – a cluster class to use in clustering.
  • kwargs – any keywords arguments to pass to Clusters.
clustering.algorithms.choose(iterable)[source]

Choose a random element in an iterable of iterable.

clustering.algorithms.hierarchical_clustering(patterns, Clusters, clustering_algorithm=<function bottom_up_clustering>, **kwargs)[source]

Perform hierarchical clustering on patterns according to a clustering algorithm and a measure.

This function ::
Finds microclasses. Performs the clustering, Finds the macroclasses (and exports them), Returns the inflection class tree.

Scoring, finding the best merges, merging nodes depends on the Clusters class.

Parameters:
  • patterns (pandas.DataFrame) – a dataframe of strings representing alternation patterns.
  • Clusters – a cluster class to use in clustering.
  • clustering_algorithm (func) – a clustering algorithm.
  • kwargs – any keywords arguments to pass to Clusters. Some keywords are mandatory : “prefix” should be the log file prefix, “patterns” should be a function for pattern finding
clustering.algorithms.log_classes(classes, prefix, suffix)[source]
clustering.algorithms.top_down_clustering(patterns, microclasses, Clusters, **kwargs)[source]

Cluster microclasses in a top-down recursive fashion.

The algorithm is the following:

Begin with one unique cluster containing all microclasses, and one empty cluster.
While we are seeing an improvement:
    Find the best possible shift of a microclass from one cluster to another.
    Perform this shift.
Build a binary node with the two clusters.
Recursively apply the same algorithm to each.

The algorithm stops when it reaches leaves, or when no shift improves the score.

Scoring, finding the best shits, updating the nodes depends on the Clusters class.

Parameters:
  • patterns (pandas.DataFrame) – a dataframe of patterns.
  • (dict of str (microclasses) – list): mapping of microclasses exemplars to microclasses inventories.
  • Clusters – a cluster class to use in clustering.
  • kwargs – any keywords arguments to pass to Clusters.
clustering.clusters module

Base classes to make clustering decisions and build inflection class trees.

Author: Sacha Beniamine

class clustering.clusters.BUComparisonClustersBuilder(*args, DecisionMaker=None, Annotator=None, **kwargs)[source]

Bases: clustering.clusters._BUClustersBuilder

Comparison between measures for hierarchical clustering bottom-up clustering of Inflection classes.

This class takes two _BUClustersBuilder classes, a DecisionMaker and an Annotator. The DecisionMaker is used to find the ordered merges. When merging, the merge is performed on both classes, and the Annotator’s values (description length or distances) are used to annotate the trees of the DecisionMaker.

Variables:
  • (dict of str (microclasses) – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
  • (dict of frozenset (nodes) – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
  • preferences (dict) – Inherited. Configuration parameters.
  • ( (Annotator) – class:clustering.clusters._BUClustersBuilder): A class to use for finding ordered merges.
  • ( – class:clustering.clusters._BUClustersBuilder): A class to use for annotating the DecisionMaker.
find_ordered_merges()[source]

Find the list of all best possible merges.

merge(a, b)[source]

Merge two clusters into one.

rootnode()[source]

Return the root of the Inflection Class tree, if it exists.

clustering.descriptionlength module

Classes to make clustering decisions and build inflection class trees according to description length.

Author: Sacha Beniamine

class clustering.descriptionlength.BUDLClustersBuilder(microclasses, paradigms, **kwargs)[source]

Bases: clustering.descriptionlength._DLClustersBuilder, clustering.clusters._BUClustersBuilder

Bottom up Builder for hierarchical clusters of inflection classes with description length based decisions.

This class holds two representations of the clusters it builds. On one hand, the class Cluster represents the informations needed to compute the description length of a cluster. On the other hand, the class Node represents the inflection classes being built. A Node can have children and a parent, a Cluster can be splitted or merged.

This class inherits attributes.

Variables:
  • (dict of str (patterns) – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
  • (dict of frozenset (clusters) – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
  • preferences (dict) – Inherited. Configuration parameters.
  • attr (str) – Inherited. (class attribute) always have the value “DL”, as the nodes of the Inflection class tree have a “DL” attribute.
  • DL (float) – Inherited. A description length DL, with DL(system) = DL(M) + DL(C) + DL(P) + DL(R)
  • M (float) – Inherited. DL(M), the cost in bits to express the mapping between lexemes and microclasses.
  • C (float) – Inherited. DL(C), the cost in bits to express the mapping between microclasses and clusters.
  • P (float) – Inherited. DL(P), the cost in bits to express the relation between clusters and patterns.
  • R (float) – Inherited. DL(R), the cost in bits to disambiguiate which pattern to use in each cluster for each microclasses.
  • (dict of frozensetCluster): Inherited. Clusters, indexed by a frozenset of microclass examplars.
  • (dict of str

    Counter): Inherited. A dict of pairs of cells to a count of patterns to the number of clusters presenting this pattern for this cell.:

    { str: Counter({Pattern: int }) }
    pairs of cells -> pattern -> number of clusters with this pattern for this cell
    

    Note that the Counter’s length is written on a .length attribute, to avoid calling len() repeatedly. Remark that the count is not the same as in the class Cluster.

  • size (int) – Inherited. The size of the whole system in microclasses.
find_ordered_merges()[source]

Find the list of all best merges of two clusters.

The list is a list of tuples of length 3 containing two frozensets representing the labels of the clusters to merge and the description length of the resulting system.

merge(a, b)[source]

Merge two Clusters, build a Node to represent the result, update the DL.

Parameters:
  • a (str) – the label of a cluster to merge.
  • b (str) – the label of a cluster to merge.
class clustering.descriptionlength.Cluster(*args)[source]

Bases: object

A single cluster in MDL clustering.

A Cluster is iterable. Itering on a cluster is itering on its patterns. Cluster can be merged or separated by adding or substracting them.

Variables:
  • patterns (defaultdict) –

    For each pair of cell in the paradigms under consideration, it holds a counter of the number of microclass using each pattern in this cluster and pair of cells.:

    { str: Counter({Pattern: int }) }
    pairs of cells -> pattern -> number of microclasses using this pattern for this cell
    

    Note that the Counter’s length is written on a .length attribute, to avoid calling len() repeatedly.

  • labels (set) – the set of all exemplars representing the microclasses in this cluster.
  • size (int) – The size of this cluster. Depending on external parameters, this can be the number of microclasses or the number of lexemes belonging to the cluster.
  • totalsize (int) – The size of the whole system of clusters, either number of microclasses in the system, or number of lexemes in the system.
  • R – The cost in bits to disambiguate for each pair of cells which pattern is to be used with which microclass.
  • C – The contribution of this cluster to the cost of mapping from microclasses to clusters.
__init__(*args)[source]

Initialize single cluster.

Parameters:args (str) – Names (exemplar) of each microclass belonging to the cluster.
init_from_paradigm(class_size, paradigms, size)[source]

Populate fields according to a paradigm column.

This assumes an initialization with only one microclass.

Parameters:
  • class_size (int) – the size of the microclass
  • paradigms (pandas.DataFrame) – a dataframe of patterns.
  • size (int) – total size
class clustering.descriptionlength.TDDLClustersBuilder(microclasses, paradigms, **kwargs)[source]

Bases: clustering.descriptionlength._DLClustersBuilder

Top down builder for hierarchical clusters of inflection classes with description length based decisions.

This class holds two representations of the clusters it builds. On one hand, the class Cluster represents the informations needed to compute the description length of a cluster. On the other hand, the class Node represents the inflection classes being built. A Node can have children and a parent, a Cluster can be splitted or merged.

This class inherits attributes.

Variables:
  • (dict of str (patterns) – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
  • (dict of frozenset (history) – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
  • preferences (dict) – Inherited. Configuration parameters.
  • attr (str) – Inherited. (class attribute) always have the value “DL”, as the nodes of the Inflection class tree have a “DL” attribute.
  • DL (float) – Inherited. A description length DL, with DL(system) = DL(M) + DL(C) + DL(P) + DL(R)
  • M (float) – Inherited. DL(M), the cost in bits to express the mapping between lexemes and microclasses.
  • C (float) – Inherited. DL(C), the cost in bits to express the mapping between microclasses and clusters.
  • P (float) – Inherited. DL(P), the cost in bits to express the relation between clusters and patterns.
  • R (float) – Inherited. DL(R), the cost in bits to disambiguiate which pattern to use in each cluster for each microclasses.
  • (dict of frozensetCluster): Inherited. Clusters, indexed by a frozenset of microclass examplars.
  • (dict of str

    Counter): Inherited. A dict of pairs of cells to a count of patterns to the number of clusters presenting this pattern for this cell.:

    { str: Counter({Pattern: int }) }
    pairs of cells -> pattern -> number of clusters with this pattern for this cell
    

    Note that the Counter’s length is written on a .length attribute, to avoid calling len() repeatedly. Remark that the count is not the same as in the class Cluster.

  • size (int) – Inherited. The size of the whole system in microclasses.
  • minDL (float) – The minimum description length yet encountered.
  • (dict of frozenset – tuples): dict associating partitions with (M, C, P, R, DL) tuples.
  • left (Cluster) – left and right are temporary clusters used to divide a current cluster in two.
  • right (Cluster) – see left.
  • to_split (Node) – the node that we are currently trying to split.
__init__(microclasses, paradigms, **kwargs)[source]

Constructor.

Parameters:
  • (dict of str (microclasses) – list): mapping of microclasses exemplars to microclasses inventories.
  • paradigms (pandas.DataFrame) – a dataframe of patterns.
  • kwargs – keyword arguments to be used as configuration.
find_ordered_shifts()[source]

Find the list of all best shifts of a microclass between right and left.

The list is a list of tuples of length 2 containing the label of a node to shift and the description length of the node to be splitted if we perform the shift.

initialize_clusters(paradigms)[source]

Initialize clusters with one cluster per microclass plus one for the whole.

Parameters:paradigms (pandas.DataFrame) – a dataframe of patterns.
initialize_nodes()[source]

Initialize nodes with only one root node which children are all microclasses.

initialize_subpartition(node)[source]

Initialize left and right as a subpartition of a node we want to split.

Parameters:node (Node) – The node to be splitted.
shift(label)[source]

Shift one microclass rom left to right or vice-versa

Parameters:label (str) – the label of the microclass to shift.
split_leaves()[source]

Split a cluster by replacing it with the two clusters left and right.

Recompute the description length when left and right are separated. Build two nodes corresponding to left and right, children of to_split.

clustering.descriptionlength.weighted_log(symbol_count, message_length)[source]

Compute \(-log_{2}(symbol_count/message_length) * message_length\).

This corresponds to the product inside the sum of the description length formula when probabilities are estimated on frequencies.

Parameters:
  • symbol_count (int) – a count of symbols.
  • message_length (int) – the size of the message.
Returns:

the weighted log

Return type:

(float)

clustering.distances module

Classes and functions to make clustering decisions and build inflection class trees according to distances.

Still experimental, and unfinished.

Author: Sacha Beniamine

class clustering.distances.CompressionDistClustersBuilder(*args, **kwargs)[source]

Bases: clustering.distances._DistanceClustersBuilder

Builder for bottom up hierarchical clusters of inflection classes with compression distance.

This class inherits attributes.

Variables:
  • (dict of str (microclasses) – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
  • (dict of frozenset (DL_dict) – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
  • preferences (dict) – Inherited. Configuration parameters.
  • attr (str) – Inherited. always have the value “DL”, as the nodes of the Inflection class tree have a “DL” attribute.
  • (dict of str – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
  • (dict of frozenset – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
  • preferences – Inherited. Configuration parameters.
  • paradigms (pandas.DataFrame) – Inherited. a dataframe of patterns.
  • distances (dict) – Inherited. The distance score_matrix between clusters.
  • (dict of frozenset – float): Maps each cluster to its description length.
  • min_DL (float) – the lowest description length for the whole system yet encountered.
merge(a, b)[source]

Merge two Clusters, build a new Node, update the distances, track system DL.

Parameters:
  • a (frozenset) – the label of a cluster to merge.
  • b (frozenset) – the label of a cluster to merge.
update_distances(new)[source]

Update for compression distances.

Parameters:new (frozenset) – Frozenset of microclass exemplar representing the new cluster.
clustering.distances.DL(messages)[source]

Compute the description length of a list of messages encoded separately.

Parameters:messages (list) – List of lists of symbols. Symbols are str. They are treated as atomic.
class clustering.distances.UPGMAClustersBuilder(*args, **kwargs)[source]

Bases: clustering.distances._DistanceClustersBuilder

Builder for UPGMA hierarchical clusters of inflection classes with hamming distance.

Variables:
  • (dict of str (microclasses) – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
  • (dict of frozenset (nodes) – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
  • preferences (dict) – Inherited. Configuration parameters.
  • attr (str) – Inherited. always have the value “DL”, as the nodes of the Inflection class tree have a “DL” attribute.
  • (dict of str – list): Inherited. mapping of microclasses exemplars to microclasses inventories.
  • (dict of frozenset – Node): Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
  • preferences – Inherited. Configuration parameters.
  • paradigms (pandas.DataFrame) – Inherited. a dataframe of patterns.
  • distances (dict) – Inherited. The distance score_matrix between clusters.
update_distances(new)[source]

UPGMA update for distances.

Parameters:new (frozenset) – Frozenset of microclass exemplar representing the new cluster.
clustering.distances.compression_distance(a, b, merged)[source]

Compute the compression distances between description lengths.

Parameters:
  • a (float) – Description length of a cluster.
  • b (float) – Description length of a cluster.
  • merged (float) – Description length of the cluster merging both the clusters from a and b.
clustering.distances.compression_distance_atomic(x, y, table, microclasses, *args, **kwargs)[source]

Compute the compression distances between microclasses x and y from their exemplars.

Parameters:
  • x (str) – A microclass exemplar.
  • y (str) – A microclass exemplar.
  • table (pandas.DataFrame) – a dataframe of patterns.
  • (dict of str (microclasses) – list): mapping of microclasses exemplars to microclasses inventories.
clustering.distances.dist_matrix(table, *args, labels=None, distfun=<function hamming>, half=False, default=inf, **kwargs)[source]

Output a distance score_matrix between clusters.

Parameters:
  • table (pandas.DataFrame) – a dataframe of patterns.
  • distfun (fun) – distance function.
  • labels (iterable) – the labels between which to compute distance. Defaults to the table’s index.
  • half (bool) – Wether to fill only a half score_matrix.
  • default (float) – Default distance.
Returns:

the similarity score_matrix.

Return type:

distances (dict)

clustering.distances.hamming(x, y, table, *args, **kwargs)[source]

Compute hamming distances between x and y in table.

Parameters:
  • x (any iterable) – vector.
  • y (any iterable) – vector.
  • table (pandas.DataFrame) – a dataframe of patterns.
Returns:

the hamming distance between x and y.

Return type:

(int)

clustering.distances.split_description(descriptions)[source]

Split each description of a list on spaces to obtain symbols.

clustering.distances.table_to_descr(table, exemplars, microclasses)[source]

Create a list of descriptions from a paradigmatic table.

Parameters:
  • table (pandas.DataFrame) – a dataframe of patterns.
  • exemplars (iterable of str) – The microclasses to include in the description.
  • (dict of str (microclasses) – list): mapping of microclasses exemplars to microclasses inventories.
clustering.utils module

Utilities used in clustering.

Author:Sacha Beniamine.

class clustering.utils.Node(labels, children=None, **kwargs)[source]

Bases: object

Represent an inflection class tree.

Variables:
  • labels (list) – labels of all the leaves under this node.
  • children (list) – direct children of this node.
  • attributes (dict) –

    attributes for this node. Currently, three attributes are expected: size (int): size of the group represented by this node. DL (float): Description length for this node. color (str): color of the splines from this node to its children, in a format usable by pyplot. Currently, red (“r”) is used when the node didn’t decrease Description length, blue (“b”) otherwise. macroclass (bool): Is the node in a macroclass ? macroclass_root (bool): Is the node the root of a macroclass ?

    The attributes “_x” and “_rank” are reserved, and will be overwritten by the draw function.

__init__(labels, children=None, **kwargs)[source]

Node constructor.

Parameters:
  • labels (iterable) – labels of all the leaves under this node.
  • children (list) – direct children of this node.
  • kwargs – any other keyword argument will be added as node attributes. Note that certain algorithm expect the Node to have (int) “size”, (str) “color”, (bool) “macroclass”, or (float) “DL” attributes.
  • attributes "_x" and "_rank" are reserved, (The) –
  • will be overwritten by the draw function. (and) –
compute_xy(tree_placement=False, pos=None)[source]
draw(horizontal=False, square=False, leavesfunc=<function Node.<lambda>>, nodefunc=None, label_rotation=None, annotateOnlyMacroclasses=False, point=None, edge_attributes=None, interactive=False, lattice=False, pos=None, **kwargs)[source]

Draw the tree as a dendrogram-style pyplot graph.

Example:

                          square=True        square=False

                         │  ┌──┴──┐         │    ╱╲
horizontal=False         │  │   ┌─┴─┐       │   ╱  ╲
                         │  │   │   │       │  ╱   ╱╲
                         │  │   │   │       │ ╱   ╱  ╲
                         │__│___│___│       │╱___╱____╲

                        │─────┐             │⟍
                        │───┐ ├             │  ⟍
horizontal=True         │   ├─┘             │⟍ ⟋
                        │───┘               │⟋
                        │____________       │____________
Parameters:
  • horizontal (bool) – Should the tree be drawn with leaves on the y axis ? (Defaults to False: leaves on x axis).
  • square (bool) – Should the tree splines be squared with 90° angles ? (Defauls to False)
  • leavesfunc (fun) – A function that will be applied to leaves before writing them down. Takes a Node, returns a str.
  • nodefunc (fun) – A function that will be applied to nodes to annotate them. Takes a Node, returns a str.
  • keep_above_macroclass (bool) – For macroclass history trees: Should the edges above macroclasses be drawn ? (Defaults to True).
  • annotateOnlyMacroclasses – For macroclass history trees: If True and nodelabel isn’t None, only the macroclasses nodes are annotated.
  • point (fun) – A function that maps a node to point attributes.
  • edge_attributes (fun) –
    A function that maps a pair of nodes to edge attributes.
    By default, use the parent’s color and “-” linestyle for nodes, “–” for leaves.
  • interactive (bool) – Whether this is destined to create an interactive plot.
  • lattice (bool) – Whether this node is a lattice rather than a tree.
  • pos (dict) – A dictionnary of node label to x,y positions. Compatible with networkx layout functions. If absent, use networkx’s graphviz layout.
leaves()[source]
macroclasses(parent_is_macroclass=False)[source]

Find all the macroclasses nodes in this tree

to_latex(nodelabel=None, vertical=True, level_dist=50, square=True, leavesfunc=<function Node.<lambda>>, scale=1)[source]

Return a latex string, compatible with tikz-qtree

Parameters:
  • nodelabel – The name of the attribute to write on the nodes.
  • vertical – Should the tree be drawn vertically ?
  • level_dist – Distance between levels.
  • square – Should the arcs have a squared shape ?
  • leavesfunc (fun) – A function that will be applied to leaves before writing them down. Takes a Node, returns a str.
  • scale (int) – defaults to 1. tikzpicture scale argument.
to_networkx()[source]
clustering.utils.find_microclasses(paradigms)[source]

Find microclasses in a paradigm (lines with identical rows).

This is useful to identify an exemplar of each inflection microclass, and limit further computation to the collection of these exemplars.

Parameters:paradigms (pandas.DataFrame) – a dataframe containing inflectional paradigms. Columns are cells, and rows are lemmas.
Returns:
microclasses (dict).
classes is a dict. Its keys are exemplars, its values are lists of the name of rows identical to the exemplar. Each exemplar represents a macroclass.
>>> classes
{"a":["a","A","aa"], "b":["b","B","BBB"]}
clustering.utils.find_min_attribute(tree, attr)[source]

Find the minimum value for an attribute in a tree.

Parameters:
  • tree (Node) – The tree in which to find the minimum attribute.
  • attr (str) – the attribute’s key.
clustering.utils.string_to_node(string, legacy_annotation_name=None)[source]

Parse an inflection tree written as a string.

Example

In the label, fields are separated by “#” as such:

(<labels>#<size>#<DL>#<color> (... ) (... ) )
Returns:The root of the tree
Return type:inflexClass.Node
Module contents
entropy package
Submodules
entropy.distribution module

author: Sacha Beniamine.

Encloses distribution of patterns on paradigms.

class entropy.distribution.PatternDistribution(paradigms, patterns, pat_dic, features=None)[source]

Bases: object

Statistical distribution of patterns.

Variables:
  • paradigms (pandas.DataFrame) – containing forms.
  • patterns (pandas.DataFrame) – containing pairwise patterns of alternation.
  • classes (pandas.DataFrame) – containing a representation of applicable patterns from one cell to another. Index are lemmas.
  • (dict of int (entropies) – pandas.DataFrame): dict mapping n to a dataframe containing the entropies for the distribution \(P(c_{1}, ..., c_{n} → c_{n+1})\).
__init__(paradigms, patterns, pat_dic, features=None)[source]

Constructor for PatternDistribution.

Parameters:
  • patterns (pandas.DataFrame) – patterns (columns are pairs of cells, index are lemmas).
  • logfile (TextIOWrapper) – Flow on which to write a log.
add_features(series)[source]
entropy_matrix(silent=False)[source]

Return a:class:pandas:pandas.DataFrame with unary entropies, and one with counts of lexemes.

The result contains entropy \(H(c_{1} \to c_{2})\).

Values are computed for all unordered combinations of \((c_{1}, c_{2})\) in the PatternDistribution.paradigms’s columns. Indexes are predictor cells \(c{1}\) and columns are the predicted cells \(c{2}\).

Example

For two cells c1, c2, entropy of c1 → c2, noted \(H(c_{1} \to c_{2})\) is:

\[H( patterns_{c1, c2} | classes_{c1, c2} )\]
n_preds_distrib_log(logfile, n, sanity_check=False)[source]

Print a log of the probability distribution for two predictors.

Writes down the distributions:

\[P( patterns_{c1, c3}, \; \; patterns_{c2, c3} \; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]

for all unordered combinations of two column names in PatternDistribution.paradigms.

Parameters:
  • logfile (io.TextIOWrapper) – Output flow on which to write.
  • n (int) – number of predictors.
  • sanity_check (bool) – Use a slower calculation to check that the results are exact.
n_preds_entropy_matrix(n)[source]

Return a:class:pandas:pandas.DataFrame with nary entropies, and one with counts of lexemes.

The result contains entropy \(H(c_{1}, ..., c_{n} \to c_{n+1} )\).

Values are computed for all unordered combinations of \((c_{1}, ..., c_{n+1})\) in the PatternDistribution.paradigms’s columns. Indexes are tuples \((c_{1}, ..., c_{n})\) and columns are the predicted cells \(c_{n+1}\).

Example

For three cells c1, c2, c3, (n=2) entropy of c1, c2 → c3, noted \(H(c_{1}, c_{2} \to c_{3})\) is:

\[H( patterns_{c1, c3}, \; \; patterns_{c2, c3}\; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]
Parameters:n (int) – number of predictors.
one_pred_distrib_log(logfile, sanity_check=False)[source]

Print a log of the probability distribution for one predictor.

Writes down the distributions \(P( patterns_{c1, c2} | classes_{c1, c2} )\) for all unordered combinations of two column names in PatternDistribution.paradigms. Also writes the entropy of the distributions.

Parameters:
  • logfile (io.TextIO) – Output flow on which to write.
  • sanity_check (bool) – Use a slower calculation to check that the results are exact.
read_entropy_from_file(filename)[source]

Read already computed entropies from a file.

Parameters:filename – the file’s path.
value_check(n, logfile=None)[source]

Check that predicting from n predictors isn’t harder than with less.

Check that the value of entropy from n predictors c1, ….cn is lower than the entropy from n-1 predictors c1, …, cn-1 (for all computed n preds entropies).

Parameters:
  • n – number of predictors.
  • logfile (io.TextIOWrapper) – Output flow on which to write the detail of the result (optional).
class entropy.distribution.SplitPatternDistribution(paradigms_list, patterns_list, pat_dic_list, names, logfile=None, features=None)[source]

Bases: entropy.distribution.PatternDistribution

Implicative entropy distribution for split systems

Split system entropy is the joint entropy on both systems.

cond_bipartite_entropy(target=0, known=1)[source]

Entropie conditionnelle entre les deux systèmes, H(c1->c2|c1’->c2’) ou H(c1’->c2’|c1->c2)

mutual_information(normalize=False)[source]

Information mutuelle entre les deux systèmes.

entropy.distribution.dfsum(df, **kwargs)[source]
entropy.distribution.merge_split_df(dfs)[source]
entropy.distribution.value_norm(df)[source]

Rounding at 10 significant digits, avoiding negative 0s

entropy.utils module
entropy.utils.P(x, subset=None)[source]

Return the probability distribution of elements in a pandas.core.series.Series.

Parameters:
  • x (pandas.core.series.Series) – A series of data.
  • subset (iterable) – Only give the distribution for a subset of values.
Returns:

A pandas.core.series.Series which index are x’s elements and which values are their probability in x.

entropy.utils.cond_P(A, B, subset=None)[source]

Return the conditional probability distribution P(A|B) for elements in two pandas.core.series.Series.

Parameters:
  • A (pandas.core.series.Series) – A series of data.
  • B (pandas.core.series.Series) – A series of data.
  • subset (iterable) – Only give the distribution for a subset of values.
Returns:

A pandas.core.series.Series whith two indexes. The first index is from the elements of B, the second from the elements of A. The values are the P(A|B).

entropy.utils.cond_entropy(A, B, **kwargs)[source]
Calculate the conditional entropy between two series of data points.
Presupposes that values in the series are of the same type, typically tuples.
Parameters:
  • A (pandas.core.series.Series) – A series of data.
  • B (pandas.core.series.Series) – A series of data.
Returns:

H(A|B)

entropy.utils.entropy(A)[source]

Calculate the entropy for a series of probabilities.

Parameters:A (pandas.core.series.Series) – A series of data.
Returns:H(A)
Module contents
lattice package
Submodules
lattice.lattice module
class lattice.lattice.ICLattice(dataframe, leaves, annotate=None, dummy_formatter=None, keep_names=True, comp_prefix=None, col_formatter=None, na_value=None, AOC=False, collections=False, verbose=True)[source]

Bases: object

Inflection Class Lattice.

This is a wrapper around (concepts.Context).

__init__(dataframe, leaves, annotate=None, dummy_formatter=None, keep_names=True, comp_prefix=None, col_formatter=None, na_value=None, AOC=False, collections=False, verbose=True)[source]
Parameters:
  • dataframe (pandas.DataFrame) – A dataframe
  • leaves (dict) – Dictionnaire de microclasses
  • annotate (dict) – Extra annotations to add on lattice. Of the form: {<object label>:<annotation>}
  • dummy_formatter (func) – Function to make dummies from the table. (default to panda’s)
  • keep_names (bool) – whether to keep original column names when dropping duplicate dummy columns.
  • comp_prefix (str) – If there are two sets of properties, the prefix used to distinguish column names.
  • AOC (bool) – Whether to limit ourselves to Attribute or Object Concepts.
  • col_formatter (func) – Function to format columns in the context table.
  • na_value – A value tu use as “Na”. Defaults to None
  • collections (bool) – Whether the table contains representations.patterns.PatternCollection objects.
ancestors(identifier)[source]

Return all ancestors of a node which corresponds to the identifier.

draw(filename, title='Lattice', **kwargs)[source]

Draw the lattice using clustering.Node’s drawing function.

parents(identifier)[source]

Return all direct parents of a node which corresponds to the identifier.

stats()[source]

Returns some stats about the classification size and shape. Based on self.nodes, not self.lattice: stats are different depending on AOC/not AOC.

to_html(**kwargs)[source]
lattice.lattice.to_dummies(table, **kwargs)[source]

Make a context table from a dataframe.

Parameters:table (pandas.DataFrame) – A dataframe of patterns or strings
Returns:A context table.
Return type:dummies (pandas.DataFrame)
lattice.lattice.to_html_disabled(*args, **kwargs)[source]
Module contents
representations package
Submodules
representations.alignment module

author: Sacha Beniamine.

This module is used to align sequences.

representations.alignment.align_auto(s1, s2, insert_cost, sub_cost, distance_only=False, fillvalue='', **kwargs)[source]

Return all the best alignments of two words according to some edit distance matrix.

Parameters:
  • s1 (str) – first word to align
  • s2 (str) – second word to align
  • insert_cost (func) – A function which takes one value and returns an insertion cost
  • sub_cost (func) – A function which takes two values and returns a substitution cost
  • distance_only (bool) – defaults to False. If True, returns only the best distance. If False, returns an alignment.
  • fillvalue – (optional) the value with which to pad when iterable have varying lengths. Default: “”.
Returns:

Either an alignment (a list of list of zipped tuples), or a distance (if distance_only is True).

representations.alignment.align_baseline(*args, **kwargs)[source]

Simple alignment intended as an inflectional baseline. (Albright & Hayes 2002)

single change, either suffixal, or suffixal, or infixal. This doesn’t work well when there is both a prefix and a suffix. Used as a baseline for evaluation of the auto-aligned patterns.

see “Modeling English Past Tense Intuitions with Minimal Generalization”, Albright, A. & Hayes, B. Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning - Volume 6, Association for Computational Linguistics, 2002, 58-69, page 2 :

“The exact procedure for finding a word-specific rule is as follows: given an input pair (X, Y), the model first finds the maximal left-side substring shared by the two forms (e.g., #mɪs), to create the C term (left side context). The model then exam- ines the remaining material and finds the maximal substring shared on the right side, to create the D term (right side context). The remaining material is the change; the non-shared string from the first form is the A term, and from the second form is the B term.”

Examples

>>> align_baseline("mɪs","mas")
[('m', 'm'), ('ɪ', 'a'), ('s', 's')]
>>> align_baseline("mɪs","mɪst")
[('m', 'm'), ('ɪ', 'ɪ'), ('s', 's'), ('', 't')]
>>> align_baseline("mɪs","amɪs")
[('', 'a'), ('m', 'm'), ('ɪ', 'ɪ'), ('s', 's')]
>>> align_baseline("mɪst","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
Parameters:
  • *args – any number of iterables >= 2
  • fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.
Returns:

a list of zipped tuples.

representations.alignment.align_left(*args, **kwargs)[source]

Align left all arguments (wrapper around zip_longest).

Examples

>>> align_left("mɪs","mas")
[('m', 'm'), ('ɪ', 'a'), ('s', 's')]
>>> align_left("mɪs","mɪst")
[('m', 'm'), ('ɪ', 'ɪ'), ('s', 's'), ('', 't')]
>>> align_left("mɪs","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('', 's')]
>>> align_left("mɪst","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
Parameters:
  • *args – any number of iterables >= 2
  • fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.
Returns:

a list of zipped tuples, left aligned.

representations.alignment.align_multi(*strings, **kwargs)[source]

Levenshtein-style alignment over arguments, two by two.

representations.alignment.align_right(*iterables, **kwargs)[source]

Align right all arguments. Zip longest with right alignment.

Examples

>>> align_right("mɪs","mas")
[('m', 'm'), ('ɪ', 'a'), ('s', 's')]
>>> align_right("mɪs","mɪst")
[('', 'm'), ('m', 'ɪ'), ('ɪ', 's'), ('s', 't')]
>>> align_right("mɪs","amɪs")
[('', 'a'), ('m', 'm'), ('ɪ', 'ɪ'), ('s', 's')]
>>> align_right("mɪst","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
Parameters:
  • *iterables – any number of iterables >= 2
  • fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.
Returns:

a list of zipped tuples, right aligned.

representations.alignment.commonprefix(*args)[source]

Given a list of strings, returns the longest common prefix

representations.alignment.commonsuffix(*args)[source]

Given a list of strings, returns the longest common suffix

representations.alignment.levenshtein_ins_cost(*_)[source]
representations.alignment.levenshtein_sub_cost(a, b)[source]
representations.alignment.multi_sub_cost(a, b)[source]
representations.confusables module

author: Sacha Beniamine.

This module is used to get characters similar to other utf8 characters.

representations.confusables.parse(filename)[source]

Parse a file with confusable chars association, return a dict.

representations.contexts module

author: Sacha Beniamine.

This module implements patterns’ contexts, which are series of phonological restrictions.

class representations.contexts.Context(segments)[source]

Bases: object

Context for an alternation pattern

classmethod merge(contexts, debug=False)[source]

Merge contexts to generalize them.

Merge contexts and combine their restrictions into a new context.

Parameters:
  • contexts – iterable of Contexts.
  • debug – whether to print debug strings.
Returns:

a merged context

to_str(mode=2)[source]
representations.generalize module

author: Sacha Beniamine.

This module is used to generalise pats contexts.

representations.generalize.generalize_patterns(pats, debug=False)[source]

Generalize these patterns’ context.

Parameters:
  • patterns – an iterable of Patterns.Pattern
  • debug – whether to print debug strings.
Returns:

a new Patterns.Pattern.

representations.generalize.incremental_generalize_patterns(*args)[source]

Merge patterns incrementally as long as the pattern has the same coverage.

Attempt to merge each patterns two by two, and refrain from doing so if the pattern doesn’t match all the lexemes that lead to its inference. Also attempt to merge together patterns that have not been merged with others.

Parameters:*args – the patterns
Returns:a list of patterns, at best of length 1, at worst of the same length as the input.
representations.patterns module

author: Sacha Beniamine.

This module addresses the modeling of inflectional alternation patterns.

class representations.patterns.BinaryPattern(*args, **kwargs)[source]

Bases: representations.patterns.Pattern

Represent the alternation pattern between two forms.

A BinaryPattern is a Patterns.Pattern over just two forms. Applying the pattern to one of the original forms yields the second one.

As an example, we will use the following alternation in a present verb of french:

cells Forms Transcription
prs.1.sg ⇌ prs.2.pl j’amène ⇌ vous amenez amEn ⇌ amənE

Example

>>> cells = ("prs.1.sg", "prs.2.pl")
>>> forms = ("amEn", "amənE")
>>> p = Pattern(cells, forms, aligned=False)
>>> type(p)
representations.patterns.BinaryPattern
>>> p
E_ ⇌ ə_E / am_n_ <0>
>>> p.apply("amEn",cells)
'amənE'
applicable(form, cell)[source]

Test if this pattern matches a form, i.e. if the pattern is applicable to the form.

Parameters:
  • form (str) – a form.
  • cell (str) – A cell contained in self.cells.
Returns:

whether the pattern is applicable to the form from that cell.

Return type:

bool

apply(form, names, raiseOnFail=True)[source]

Apply the pattern to a form.

Parameters:
  • form – a form, assumed to belong to the cell names[0].
  • names – apply to a form of cell names[0] to produce a form of cell names[1] (default:self.cells). Patterns being non-oriented, it is better to use the names argument.
  • raiseOnFail (bool) – defaults to True. If true, raise an error when the pattern is not applicable to the form. If False, return None instead.
Returns:

form belonging the opposite cell.

exception representations.patterns.NotApplicable[source]

Bases: Exception

Raised when a patterns.Pattern can’t be applied to a form.

class representations.patterns.Pattern(cells, forms, aligned=False, **kwargs)[source]

Bases: object

Represent an alternation pattern and its context.

The pattern can be defined over an arbitrary number of forms. If there are only two forms, a patterns.BinaryPattern will be created.

cells (tuple): Cell labels.

alternation (dict of str: list of tuple):
Maps the cell’s names to a list of tuples of alternating material.
context (tuple of str):
Sequence of (str, Quantifier) pairs or “{}” (stands for alternating material.)
score (float):
A score used to choose among patterns.

Example

>>> cells = ("prs.1.sg", "prs.1.pl","prs.2.pl")
>>> forms = ("amEn", "amənõ", "amənE")
>>> p = patterns.Pattern(cells, forms, aligned=False)
>>> p
E_ ⇌ ə_ɔ̃ ⇌ ə_E / am_n_ <0>
__init__(cells, forms, aligned=False, **kwargs)[source]

Constructor for Pattern.

Parameters:
  • cells (iterable) – Cells labels (str), in the same order.
  • forms (iterable) – Forms (str) to be segmented.
  • aligned (bool) – whether forms are already aligned. Otherwise, left alignment will be performed.
alternation_list(exhaustive_blanks=True, use_gen=False, filler='_')[source]

Return a list of the alternating material, where the context is replaced by a filler.

Parameters:
  • exhaustive_blanks (bool) – Whether initial and final contexts should be marked by a filler.
  • use_gen (bool) – Whether the alternation should be the generalized one.
  • filler (str) – Alternative filler used to join alternation members.
Returns:

a list of str of alternating material, where the context is replaced by a filler.

is_identity()[source]
classmethod new_identity(cells)[source]

Create a new identity pattern for a given set of cells.

to_alt(exhaustive_blanks=True, use_gen=False, **kwargs)[source]

Join the alternating material obtained with alternation_list() in a str.

class representations.patterns.PatternCollection(collection)[source]

Bases: object

Represent a set of patterns.

representations.patterns.are_all_identical(iterable)[source]

Test whether all elements in the iterable are identical.

representations.patterns.find_alternations(paradigms, method, **kwargs)[source]

Find local alternations in a Dataframe of paradigms.

For each pair of form in the paradigm, keep only the alternating material (words are left-aligned). Return the resulting DataFrame.

Parameters:
  • paradigms (pandas.DataFrame) – a dataframe containing inflectional paradigms. Columns are cells, and rows are lemmas.
  • method (str) – “local” uses pairs of forms, “global” uses entire paradigms.
Returns:

a dataframe with the same indexes as paradigms and as many columns as possible combinations of columns in paradigms, filled with segmented patterns.

Return type:

pandas.DataFrame

representations.patterns.find_applicable(paradigms, pat_dict, disable_tqdm=False, **kwargs)[source]

Find all applicable rules for each form.

We name sets of applicable rules classes. Classes are oriented: we produce two separate columns (a, b) and (b, a) for each pair of columns (a, b) in the paradigm.

Parameters:
  • paradigms (pandas.DataFrame) – paradigms (columns are cells, index are lemmas).
  • pat_dict (dict) – a dict mapping a column name to a list of patterns.
  • disable_tqdm (bool) – if true, do not show progressbar
Returns:

associating a lemma (index) and an ordered pair of paradigm cells (columns) to a tuple representing a class of applicable patterns.

Return type:

(pandas.DataFrame)

representations.patterns.find_endings(paradigms, *args, disable_tqdm=False, **kwargs)[source]

Find suffixes in a paradigm.

Return a DataFrame of endings where we remove in each row the common prefix to all the row’s cells.

Parameters:
  • paradigms (pandas.DataFrame) – a dataframe containing inflectional paradigms. Columns are cells, and rows are lemmas.
  • disable_tqdm (bool) – if true, do not show progressbar
Returns:

a dataframe of the same shape filled with segmented endings.

Return type:

pandas.DataFrame

Example

>>> df = pd.DataFrame([["amEn", "amEn", "amEn", "amənõ", "amənE", "amEn"]],
     columns=["prs.1.sg",  "prs.2.sg", "prs.3.sg", "prs.1.pl", "prs.2.pl","prs.3.pl"],
     index=["amener"])
>>> df
       prs.1.sg prs.2.sg prs.3.sg prs.1.pl prs.2.pl prs.3.pl
amener     amEn     amEn     amEn    amənõ    amənE     amEn
>>> find_endings(df)
       prs.1.sg prs.2.sg prs.3.sg prs.1.pl prs.2.pl prs.3.pl
amener       En       En       En      ənõ      ənE       En
representations.patterns.find_patterns(paradigms, method, **kwargs)[source]

Find Patterns in a DataFrame according to any general method.

Methods can be:
  • suffix (align left),
  • prefix (align right),
  • baseline (see Albright & Hayes 2002)
  • levenshtein (dynamic alignment using levenshtein scores)
  • similarity (dynamic alignment using segment similarity scores)
Parameters:
  • paradigms (pandas.DataFrame) – paradigms (columns are cells, index are lemmas).
  • method (str) – “suffix”, “prefix”, “baseline”, “levenshtein” or “similarity”
Returns:

patterns,pattern_dict. Patterns is the created pandas.DataFrame, pat_dict is a dict mapping a column name to a list of patterns.

Return type:

(tuple)

representations.patterns.from_csv(filename, defective=True, overabundant=True)[source]

Read a Patterns Dataframe from a csv

representations.patterns.make_pairs(paradigms)[source]

Join columns with ” ⇌ ” by combination.

The output has one column for each pairs on the paradigm’s columns.

representations.patterns.to_csv(dataframe, filename, pretty=False)[source]

Export a Patterns DataFrame to csv.

representations.quantity module

author: Sacha Beniamine.

This module provides Quantity objects to represent quantifiers.

class representations.quantity.Quantity(mini, maxi)[source]

Bases: object

Represents a quantifier as an interval.

This is a flyweight class and the presets are :

description mini maxi regex symbol variable name
Match one 1 1   quantity.one
Optional 0 1 ? quantity.optional
Some 1 inf + quantity.some
Any 0 inf * quantity.kleenestar
None 0 0    
__init__(mini, maxi)[source]
Parameters:
  • mini (int) – the minimum number of elements matched.
  • maxi (int) – the maximum number of elements matched.
representations.quantity.quantity_largest(args)[source]

Reduce on the “&” operator of quantities.

Returns a quantity with the minimum left value and maximum right value.

Example

>>> quantity_largest([Quantity(0,1),Quantity(1,1),Quantity(1,np.inf)])
Quantity(0,np.inf)
Argument:
args: an iterable of quantities.
representations.quantity.quantity_sum(args)[source]

Reduce on the “+” operator of quantities.

Returns a quantity with the minimum left value and the sum of the right value.

Example

>>> quantity_largest([Quantity(0,1),Quantity(1,1),Quantity(0,0)])
Quantity(0,1)
Argument:
args: an iterable of quantities.
representations.segments module

author: Sacha Beniamine.

This module addresses the modelisation of phonological segments.

class representations.segments.Segment(classes, features, alias, chars, shorthand=None)[source]

Bases: object

The Segments.Segment class holds the definition of a single segment.

This is a lightweight class.

Variables:
  • name (str or _CharClass) – Name of the segment.
  • features (frozenset of tuples) – The tuples are of the form (attribute, value) with a positive value, used for set operations.
__init__(classes, features, alias, chars, shorthand=None)[source]

Constructor for Segments.

classmethod get(descriptor)[source]

Get a Segment from an alias.

classmethod get_from_transform(a, transform)[source]

Get a segment from another according to a transformation tuple.

In the following example, the segments have been initialized with French segment definitions.

Parameters:
  • a (str) – Segment alias
  • transform (tuple) – Couple of two strings of segments aliases.

Example

>>> Segment.get_from_transform("d",("bdpt", "fsvz"))
'z'
classmethod get_transform_features(left, right)[source]

Get the features corresponding to a transformation.

Parameters:
  • left (tuple) – string of segment aliases
  • right (tuple) – string of segment aliases

Example

>>> Segment.get_from_transform("bd", "pt")
{'+vois'}, {'-vois'}
classmethod init_dissimilarity_matrix(gap_prop=0.24, **kwargs)[source]

Compute score matrix with dissimilarity scores.

classmethod insert_cost(*_)[source]
classmethod intersect(*args)[source]

Intersect some segments from their names/aliases. This is the “meet” operation on the lattice nodes, and returns the lowest common ancestor.

Returns:a str or _CharClass representing the segment which classes are the intersection of the input.
classmethod set_max()[source]

Set a variable to the top of the natural classes lattice.

classmethod show_pool(only_single=False)[source]

Return a string description of the whole segment pool.

similarity[source]

Compute phonological similarity (Frisch, 2004)

The function is memoized. Measure from “Similarity avoidance and the OCP” , Frisch, S. A.; Pierrehumbert, J. B. & Broe, M. B. Natural Language & Linguistic Theory, Springer, 2004, 22, 179-228, p. 198.

We compute similarity by comparing the number of shared and unshared natural classes of two consonants, using the equation in (7). This equation is a direct extension of the Pierrehumbert (1993) feature similarity metric to the case of natural classes.

  1. \(Similarity = \frac{\text{Shared natural classes}}{\text{Shared natural classes } + \text{Non-shared natural classes}}\)
classmethod sub_cost(a, b)[source]
classmethod transformation(a, b)[source]

Find a transformation between aliases a and b.

The transformation is a pair of two maximal sets of segments related by a bijective phonological function.

This function takes a pair of strings representing segments. It calculates the function which relates these two segments. It then finds the two maximal sets of segments related by this function.

Example

In French, t -> s can be expressed by a phonological function which changes [-cont] and [-rel. ret] to [+cont] and [+rel. ret]

These other segments are related by the same change: d -> z b -> v p -> f

>>> a,b = Segment.transformation("t","s")
>>> print(a,b)
[bdpt] [fsvz]
Parameters:a,b (str) – Segment aliases.
Returns:two charclasses.
representations.segments.initialize(filename, sep='\t', verbose=False)[source]
representations.segments.make_aliases(ipa)[source]

Associate one symbol to segments that take two characters. Return restoration map.

This function takes a segments table and changes the entries of the “Segs.” column with a unique character for each multi-chars cell in the Segs. A dict is returned that allows for original segment name restoration.

Input Segs. Output Segs.
ɑ̃ â
a a

The table can have an optional UNICODE column. It will be dropped at the end of the process.

Parameters:ipa (pandas.DataFrame) – Dataframe of segments. Columns are features and indexes are segments. A UNICODE col can specify alt chars.
Returns:maps from the simplified name to the original segments name.
Return type:alias_map (dict)
representations.segments.normalize(ipa, features)[source]

Assign a normalized segment to groups of segments with identical rows.

This function takes a segments table and adds in place a “Normalized” column. This column contains a common value for each segment with identical boolean values. The function also returns a translation table mapping indexes to normalized segments.

Note: the index are expected to be one char length.

Index ..features.. Normalized
ɛ […] E
e […] E
Parameters:
  • ipa (pandas.DataFrame) – Dataframe of segments. Columns are features, UNICODE code point representation and segment names, indexes are segments.
  • features (list) – Feature columns’ names.
Returns:

translation table from the segment’s nameto its normalized name.

Return type:

norm_map (dict)

representations.segments.restore(char)[source]

Restore the original string from an alias.

representations.segments.restore_segment_shortest(segment)[source]

Restore segment to the shortest of either the original character or its feature list.

representations.segments.restore_string(string)[source]

Restore the original string from a string of aliases.

representations.utils module

author: Sacha Beniamine.

Utility functions for representations.

representations.utils.create_features(data_file_name)[source]

Read feature and preprocess to be coindexed with paradigms.

representations.utils.create_paradigms(data_file_name, cols=None, verbose=False, fillna=True, segcheck=False, merge_duplicates=False, defective=False, overabundant=False, merge_cols=False)[source]

Read paradigms data, and prepare it according to a Segment class pool.

Parameters:
  • data_file_name (str) – path to the paradigm csv file.
  • characters occuring in the paradigms except the first column (All) –
  • be inventoried in this class. (should) –
  • cols (list of str) – a subset of columns to use from the paradigm file.
  • verbose (bool) – verbosity switch.
  • merge_duplicates (bool) – should identical columns be merged ?
  • fillna (bool) – Defaults to True. Should #DEF# be replaced by np.NaN ? Otherwise they are filled with empty strings (“”).
  • segcheck (bool) – Defaults to False. Should I check that all the phonological segments in the table are defined in the segments table ?
  • defective (bool) – Defaults to False. Should I keep rows with defective forms ?
  • overabundant (bool) – Defaults to False. Should I keep rows with overabundant forms ?
  • merge_cols (bool) – Defaults to False. Should I merge identical columns (fully syncretic) ?
Returns:

paradigms table (columns are cells, index are lemmas).

Return type:

paradigms (pandas.DataFrame)

representations.utils.normalize_dataframe(paradigms, aliases, normalization, verbose=False)[source]

Normalize and Simplify a dataframe.

aliases:
For all sequence of n characters representing a segment, replace with a unique character representing this segment.
Normalization:
for all groups of characters representing the same feature set, translate to one unique character.
Note:
a .translate strategy works for normalization but not for aliases, since it only maps single characters to single characters. The order of operations is important, since .translate assumes mapping of 1: 1 chars.
Parameters:
  • paradigms (pandas.DataFrame) – paradigms table (columns are cells, index are lemmas).
  • aliases (dict) – dictionnary of segments (as found in the paradigms) to their aliased versions (one char length)
  • normalization (dict) – dictionnary of 1 aliased character to another, to replace segments which have the same feature set.
  • verbose (bool) – verbosity switch.
Returns:

The same dataframe, normalized and simplified.

Return type:

new_df (pandas.DataFrame)

representations.utils.unique_lexemes(series)[source]

Rename duplicates in a serie of strings.

Take a pandas series of strings and output another serie where all originally duplicate strings are given numbers, so each cell contains a unique string.

Module contents
utils package
Module contents
utils.get_repository_version()[source]

Return an ID for the current git or svn revision.

If the directory isn’t under git or svn, the function returns an empty str.

Returns:
(str): svn/git version or ‘’.
utils.merge_duplicate_columns(df, sep=';', keep_names=True)[source]

Merge duplicate columns and return new DataFrame.

Parameters:
  • df (pandas.DataFrame) – A dataframe
  • sep (str) – separator to use when joining columns names.
  • keep_names (bool) – Whether to keep the names of the original duplicated columns by merging them onto the columns we keep.