qumin.representations package¶

Submodules¶

qumin.representations.alignment module¶

author: Sacha Beniamine.

This module is used to align sequences.

qumin.representations.alignment.align_auto(s1, s2, insert_cost, sub_cost, distance_only=False, fillvalue='', **kwargs)[source]¶

Return all the best alignments of two words according to some edit distance matrix.

Parameters:

s1 (str) – first word to align
s2 (str) – second word to align
insert_cost (Callable) – A function which takes one value and returns an insertion cost
sub_cost (Callable) – A function which takes two values and returns a substitution cost
distance_only (bool) – defaults to False. If True, returns only the best distance. If False, returns an alignment.
fillvalue – (optional) the value with which to pad when iterable have varying lengths. Default: “”.

Returns:

Either an alignment (a list of list of zipped tuples), or a distance (if distance_only is True).

qumin.representations.alignment.align_baseline(*args, **kwargs)[source]¶

Simple alignment intended as an inflectional baseline. (Albright & Hayes 2002)

single change, either suffixal, or suffixal, or infixal. This doesn’t work well when there is both a prefix and a suffix. Used as a baseline for evaluation of the auto-aligned patterns.

see “Modeling English Past Tense Intuitions with Minimal Generalization”, Albright, A. & Hayes, B. Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning - Volume 6, Association for Computational Linguistics, 2002, 58-69, page 2 :

“The exact procedure for finding a word-specific rule is as follows: given an input pair (X, Y), the model first finds the maximal left-side substring shared by the two forms (e.g., #mɪs), to create the C term (left side context). The model then exam- ines the remaining material and finds the maximal substring shared on the right side, to create the D term (right side context). The remaining material is the change; the non-shared string from the first form is the A term, and from the second form is the B term.”

Examples

>>> align_baseline("mɪs","mas")
[('m', 'm'), ('ɪ', 'a'), ('s', 's')]
>>> align_baseline("mɪs","mɪst")
[('m', 'm'), ('ɪ', 'ɪ'), ('s', 's'), ('', 't')]
>>> align_baseline("mɪs","amɪs")
[('', 'a'), ('m', 'm'), ('ɪ', 'ɪ'), ('s', 's')]
>>> align_baseline("mɪst","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]

Parameters:

*args – any number of iterables >= 2
fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.

Returns:

a list of zipped tuples.

qumin.representations.alignment.align_left(*args, **kwargs)[source]¶

Align left all arguments (wrapper around zip_longest).

Examples

>>> align_left("mɪs","mas")
[('m', 'm'), ('ɪ', 'a'), ('s', 's')]
>>> align_left("mɪs","mɪst")
[('m', 'm'), ('ɪ', 'ɪ'), ('s', 's'), ('', 't')]
>>> align_left("mɪs","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('', 's')]
>>> align_left("mɪst","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]

Parameters:

*args – any number of iterables >= 2
fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.

Returns:

a list of zipped tuples, left aligned.

qumin.representations.alignment.align_multi(*strings, **kwargs)[source]¶: Levenshtein-style alignment over arguments, two by two.

qumin.representations.alignment.align_right(*iterables, **kwargs)[source]¶

Align right all arguments. Zip longest with right alignment.

Examples

>>> align_right("mɪs","mas")
[('m', 'm'), ('ɪ', 'a'), ('s', 's')]
>>> align_right("mɪs","mɪst")
[('', 'm'), ('m', 'ɪ'), ('ɪ', 's'), ('s', 't')]
>>> align_right("mɪs","amɪs")
[('', 'a'), ('m', 'm'), ('ɪ', 'ɪ'), ('s', 's')]
>>> align_right("mɪst","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]

Parameters:

*iterables – any number of iterables >= 2
fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.

Returns:

a list of zipped tuples, right aligned.

qumin.representations.alignment.commonprefix(*args)[source]¶: Given a list of strings, returns the longest common prefix

qumin.representations.alignment.commonsuffix(*args)[source]¶: Given a list of strings, returns the longest common suffix

qumin.representations.alignment.edits_ins_cost(*_)[source]¶

qumin.representations.alignment.edits_sub_cost(a, b)[source]¶

qumin.representations.alignment.multi_sub_cost(a, b)[source]¶

qumin.representations.contexts module¶

author: Sacha Beniamine.

This module implements patterns’ contexts, which are series of phonological restrictions.

class qumin.representations.contexts.Context(segments, inv)[source]¶

Bases: object

Context for an alternation pattern

feat_str(inv)[source]¶

classmethod merge(contexts, inv)[source]¶

Merge contexts to generalize them.

Merge contexts and combine their restrictions into a new context.

Parameters:

contexts – iterable of Contexts.
inv – Inventory instance

Returns:

a merged context

to_str(inv, mode=2)[source]¶

qumin.representations.frequencies module¶

author: Jules Bouton.

Class for frequency management.

class qumin.representations.frequencies.Frequencies(package, *args, source=False, **kwargs)[source]¶

Bases: object

Frequency management for a Paralex dataset. Frequencies are built for forms, lexemes and cells.

The parsed frequency columns or tables should conform to the Paralex principles:

An empty value means that there is no measure available
A zero value means that there is a measure, which is zero

When aggregating accross rows, any empty cell yields a uniform distribution for the whole set of rows, whereas zeros are taken into account. This behaviour can be disabled for some functions by passing skipna=True.

Examples

>>> p = fl.Package('tests/data/TestPackage/test.package.json')
>>> f = Frequencies(p)
>>> print(f.info().to_markdown())
| Table   | Source      |   Records |   Sum(f) |   Mean(f) |
|:--------|:------------|----------:|---------:|----------:|
| forms   | forms_table |        22 |      519 |   27.3158 |
| lexemes | forms_table |         4 |      519 |  129.75   |
| cells   | forms_table |         4 |      519 |  129.75   |

Variables:

p (frictionless.Package) – package to analyze
source (Dict[str, str]) – source used by default for each table. Contains either a value for the source field of a Paralex frequency table, or the name of the table used to extract the frequency.
forms (pandas.DataFrame) – Table of frequency values associated to a form_id.
lexemes (pandas.DataFrame) – Table of frequency values associated to a lexeme_id.
cells (pandas.DataFrame) – Table of frequency values associated to a cell_id.

__init__(package, *args, source=False, **kwargs)[source]¶

Constructor for Frequencies. We gather and store frequencies for forms, lexemes and cells. Behaviour is the following:

If force_uniform is True, we use the paradigms table to generate a Uniform distribution.
If not, we try to get a frequency column from the tables: form, lexemes, cell
If any of those is missing, we use the frequencies table.
If we can’t use the frequency table, we fall back to a uniform.

Parameters:

package (frictionless.Package) – package to analyze
source (Dict[str, str]) – name of the source to use when several are available.
**kwargs – keyword arguments for frequency reading methods.

col_names = ['lexeme', 'cell', 'form']¶

drop_unused(paradigms)[source]¶: If the paradigms table implied some sampling / filtering, make sure that the frequencies are also sampled.

get_absolute_freq(mean=False, group_on=False, skipna=False, **kwargs)[source]¶

Return the frequency of an item for a given source

The frequency of an item is defined as the sum of the frequencies of this item across all rows.

Examples

>>> p = fl.Package('tests/data/TestPackage/test.package.json')
>>> f = Frequencies(p)
>>> f.get_absolute_freq(filters={'lexeme':'q'}, group_on="index", skipna=True)
form
11    12.0
12     6.0
14    20.0
18     NaN
23    20.0
Name: value, dtype: float64
>>> float(f.get_absolute_freq(filters={'lexeme':'q'}))
nan
>>> float(f.get_absolute_freq(filters={'cell':'third'}, mean=True, skipna=True))
20.0
>>> f.get_absolute_freq(group_on=['lexeme'])
lexeme
k    203.0
p      NaN
q      NaN
s     63.0
Name: value, dtype: float64

Parameters:

group_on (List[str]) – columns for which absolute frequencies should be computed. If False, aggregates across all records.
mean (bool) – Defaults to False. If True, returns a mean instead of a sum.
skipna (bool) – Defaults to False. Skip nan values for sums or means.

Returns:

a Series which contains the output values.: The index is either the original one, or the grouping columns.

Return type:

pandas.Series

get_relative_freq(group_on=False, uniform_duplicates=False, **kwargs)[source]¶

Returns the relative frequencies of a set of rows according to a set of grouping columns. If any of the values is empty, we generate a Uniform distribution for this group.

Note

To avoid long computations, we use C implementations. Unfortunately, skipna is not yet implemented in GroupBy.sum. For this reason, we use a more complex pipeline of C functions.

Examples

>>> p = fl.Package('tests/data/TestPackage/test.package.json')
>>> f = Frequencies(p)
>>> f.get_relative_freq(filters={'lexeme': 'p', 'cell':'first'}, group_on=["lexeme"])['result'].values
array([0.05882353, 0.94117647])
>>> f.get_relative_freq(filters={'lexeme': 's', 'cell':'second'}, group_on=["lexeme"])['result'].values
array([0., 1.])
>>> f.get_relative_freq(filters={'cell':"third"}, group_on=["cell"])['result'].values
array([0.25, 0.25, 0.25, 0.25])
>>> f.get_relative_freq(filters={'lexeme':'p'}, group_on=["lexeme", "cell"])['result'].values
array([0.05882353, 0.94117647, 1.        , 1.        , 1.        ])
>>> f.get_relative_freq(filters={'lexeme':'s', 'cell': 'first'}, group_on=["lexeme", "cell"]).result.values
array([0.33333333, 0.33333333, 0.33333333])

Parameters:

group_on (List[str]) – column on which relative frequencies should be computed
uniform_duplicates (bool) – Whether to give a uniform weight to duplicate items or a relative weight based on tokens.

Returns:

a DataFrame which contains a result column with the output value.: The index is the original one. The grouping columns are also provided.

Return type:

pandas.DataFrame

has_frequencies(table='forms')[source]¶

Returns True if the requested contains real frequencies.

Parameters:: table (str) – name of the table to test.

info()[source]¶

Returns a convenient DataFrame with summary statistics.

Returns:: A summary of statistics about this Frequencies handler.
Return type:: pandas.DataFrame

p = None¶

source = {'cells': None, 'forms': None, 'lexemes': None}¶

qumin.representations.generalize module¶

author: Sacha Beniamine.

This module is used to generalize pats contexts.

qumin.representations.generalize.generalize_alt(patterns, inv)[source]¶: Use the generalized alternation, using features when possible rather than segments.

qumin.representations.generalize.generalize_patterns(pats, inv)[source]¶

Generalize these patterns’ context.

Parameters:

pats (Iterable[Pattern]) – the patterns to generalize
inv – an Iventory instance

Returns:

a new pattern

Return type:

Pattern

qumin.representations.generalize.incremental_generalize_patterns(pats, inv)[source]¶

Merge patterns incrementally as long as the pattern has the same coverage.

Attempt to merge each patterns two by two, and refrain from doing so if the pattern doesn’t match all the lexemes that lead to its inference. Also attempt to merge together patterns that have not been merged with others.

Parameters:

pats – the patterns
inv – Inventory instance

Returns:

a list of patterns, at best of length 1, at worst of the same length as the input.

Return type:

List[Pattern]

qumin.representations.paradigms module¶

author: Sacha Beniamine and Jules Bouton.

Paradigms class to represent paralex paradigms.

class qumin.representations.paradigms.Paradigms(dataset, features_forms=None, **kwargs)[source]¶

Bases: object

Paradigms with methods to normalize them, merge and restore columns, etc.

__init__(dataset, features_forms=None, **kwargs)[source]¶

Read paradigms data, and prepare it according to a Segment class pool.

Parameters:

dataset (frictionless.Package) – paralex frictionless Package All characters occuring in the paradigms except the first column should be inventoried in this class.
features_forms (List[str]) – list of form-level feature columns from the paralex forms table.
kwargs – additional arguments passed to Package.preprocess()

Returns:

paradigms table: (rows contain forms, lemmas, cells).

Return type:

paradigms (pandas.DataFrame)

cells = None¶

cells_dedup = None¶

data = None¶

default_cols = ('lexeme', 'cell', 'phon_form')¶

features = []¶

find_cell_duplicates()[source]¶: Identify duplicate cells (same forms everywhere).

get_empty_pattern_df(a, b)[source]¶

Returns an oriented dataframe to store patterns for two cells.

Parameters:

a (str) – cell A name
b (str) – cell B name

get_features()[source]¶: Return the features columns for the current paradigms.

preprocess(fillna=True, segcheck=True, defective=False, overabundant=False, cells=None, sample_lexemes=None, sample_cells=None, sample_kws=None, pos=None, resegment=False, lexemes_list=None, features_lexemes=None, **kwargs)[source]¶

Preprocess a Paralex paradigms table to meet the requirements of Qumin:

Filter by POS and by cells
Filter by frequency, sample
Filter overabundance and defectivity
Merge identical columns
Check segments and create Form() objects

Parameters:

fillna (bool) – Defaults to True. Should #DEF# be replaced by np.NaN ? Otherwise they are filled with empty strings (“”).
segcheck (bool) – Defaults to True. Should I check that all the phonological segments in the table are defined in the segments table?
defective (bool) – Defaults to False. Should I keep rows with defective forms?
overabundant (bool) – Defaults to False. Should I keep rows with overabundant forms?
features_lexemes (List[str]) – Lexeme level features to store with forms.
cells (List[str]) – List of cell names to consider. Defaults to all.
pos (List[str]) – List of parts of speech to consider. Defaults to all.
lexemes_list (path) – Path to a file containing one lexeme per row.
sample_lexemes (int) – Defaults to None. Should I sample n lexemes (for debug purposes)?
sample_cells (int) – Defaults to None. Should I sample n lexemes (for debug purposes)?
sample_kws (dict) – Dict of keywords passed to _sample_paradigms().
resegment (bool) – Defaults to False. Should I resegment the paradigms?

qumin.representations.patterns module¶

author: Sacha Beniamine.

This module addresses the modeling of inflectional alternation patterns.

exception qumin.representations.patterns.NotApplicable[source]¶

Bases: Exception

Raised when a Pattern can’t be applied to a form.

class qumin.representations.patterns.Pattern(alternation, context, inv)[source]¶

Bases: object

Represent the alternation pattern between two forms.

Applying the pattern to one of the original forms yields the second one.

As an example, we will use the following alternation in a present verb of french:

cells	Forms	Transcription
prs.1.sg ⇌ prs.2.pl	j’amène ⇌ vous amenez	amEn ⇌ amənE

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> cells = ("prs.1.sg", "prs.2.pl")
>>> forms = (Form("a m E n"), Form("a m Ø n E"))
>>> p = Pattern.from_forms(cells, forms, inv)
>>> type(p)
<class 'qumin.representations.patterns.Pattern'>
>>> p
E_ ⇌ Ø_E / am_n_ <0>
>>> p.apply(Form("a m E n"), cells, inv)
Form(a m Ø n E)

__init__(alternation, context, inv)[source]¶

Constructor for Patterns.

Parameters:

cells (Iterable) – Cells labels (str), in the same order.
alternation (dict) – Dictionary of cells to alternating material (list of tuples)
context (bool) – a Context instance
inv – sounds Inventory

applicable(form, cell)[source]¶

Test if this pattern matches a form, i.e. if the pattern is applicable to the form.

Parameters:

form (str) – a form.
cell (str) – A cell contained in self.cells.

Returns:

whether the pattern is applicable to the form from that cell.

Return type:

bool

apply(form, names, inv, raiseOnFail=True)[source]¶

Apply the pattern to a form.

Parameters:

form – a form, assumed to belong to the cell names[0].
names – apply to a form of cell names[0] to produce a form of cell names[1] (default:self.cells). Patterns being non-oriented, it is better to use the names argument.
inv (segments.Inventory) – sound inventory
raiseOnFail (bool) – defaults to True. If true, raise an error when the pattern is not applicable to the form. If False, return None instead.

Returns:

form belonging the opposite cell.

classmethod from_aligned(cells, alignment, inv)[source]¶

Create a pattern fron aligned forms (aligns them left)

Parameters:

cells (Iterable) – Cells labels (str), in the same order.
alignment (Iterable) – Alogned foorms (str) to be segmented.

classmethod from_forms(cells, forms, inv)[source]¶

Create a pattern fron unaligned forms (aligns them left)

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> cells = ("prs.1.sg", "prs.2.pl")
>>> forms = (Form("a m E n"), Form("a m Ø n E"))
>>> p = Pattern.from_forms(cells, forms, inv)
>>> p
E_ ⇌ Ø_E / am_n_ <0>
>>> p.score # is zero at initialization
0
>>> p.lexemes # is empty at initialization
set()
>>> p.alternation
{'prs.1.sg': [('E',), ('',)], 'prs.2.pl': [('Ø',), ('E',)]}
>>> p.context # this is a Context
((?:a )(?:m )){}((?:n )){}
>>> p.cells
('prs.1.sg', 'prs.2.pl')
>>> p._repr
'E_ ⇌ Ø_E / am_n_'
>>> p._feat_str
'E_ ⇌ Ø_E / am_n_'
>>> p._gen_alt == {'prs.1.sg': ((frozenset({'ɑ̃', 'ɛ̃', 'i', 'j', 'E'}),), ('',)),
...                'prs.2.pl': ((frozenset({'ɥ', 'ɔ̃', 'y', 'Ø', 'œ̃'}),), ('E',))}
True

classmethod from_str(cells, string, inv)[source]¶

Parse an exported pattern.

To be parsed back, patterns need to be exported by repr(), not str().

Note: Phonemes in context classes are now separated by “,”

Parameters:

cells (tuple of str) – Cells labels (str).
string (str) – pattern given as a string.
inv (Inventory) – Sound inventory.

Returns:

a parsed Pattern object.

Return type:

Pattern

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> p = Pattern.from_str(('A', 'B'), "ɥ ⇌ yj / {E,O,a,b,d,f,g,i,j,k,l,m,n,p,s,t,u,v,w,y,z,Ø,ŋ,œ̃,ɑ̃,ɔ̃,ɛ̃,ɥ,ɲ,ʁ,ʃ,ʒ}*{b,d,f,g,k,l,m,n,p,s,t,v,z,ŋ,ɲ,ʁ,ʃ,ʒ}_E <58>", inv)
>>> type(p) is Pattern
True
>>> str(p)
'ɥ ⇌ yj / X*C_E'
>>> p
ɥ ⇌ yj / {E,O,a,b,d,f,g,i,j,k,l,m,n,p,s,t,u,v,w,y,z,Ø,ŋ,œ̃,ɑ̃,ɔ̃,ɛ̃,ɥ,ɲ,ʁ,ʃ,ʒ}*{b,d,f,g,k,l,m,n,p,s,t,v,z,ŋ,ɲ,ʁ,ʃ,ʒ}_E <58.0>
>>> p = Pattern.from_str(('A','B'), "E_ ⇌ Ø_E / am_n_ <0>", inv)
>>> type(p) is Pattern
True
>>> p
E_ ⇌ Ø_E / am_n_ <0.0>

is_identity()[source]¶

Checks whether this pattern is an identity pattern.

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> p = Pattern.new_identity(("A", "B"), inv)
>>> p.is_identity()
True

classmethod new_identity(cells, inv)[source]¶

Identity pattern factory.

The alternation is empty, and the context is a sequence of any number of allowed segments.

Parameters:

cells – Pair of cell for this pattern.
inv (Inventory) – Sound Inventory.

Returns:

a new identity pattern.

Return type:

Pattern

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> print(Pattern.new_identity(('A','B'), inv))
 ⇌  / X*

to_alt(inv, exhaustive_blanks=True, use_gen=False, **kwargs)[source]¶

Build a string representing the alternation

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> cells = ("prs.1.sg", "prs.2.pl")
>>> forms = (Form("a m E n"), Form("a m Ø n E"))
>>> p = Pattern.from_forms(cells, forms, inv)
>>> p.alternation
{'prs.1.sg': [('E',), ('',)], 'prs.2.pl': [('Ø',), ('E',)]}
>>> p.to_alt(inv)
'_E_ ⇌ _Ø_E'
>>> p.to_alt(inv, exhaustive_blanks=False)
'E_ ⇌ Ø_E'
>>> p.to_alt(inv, use_gen=True)
'_[-arro]_ ⇌ _[+arro]_E'

Parameters:

exhaustive_blanks (bool) – Whether initial and final contexts should be marked by a filler.
use_gen (bool) – Whether the alternation should use phonological generalizations (when available).

Returns:

A string representing the alternation, with contexts positions replaced by the filler “_”.

qumin.representations.patterns.are_all_identical(iterable)[source]¶: Test whether all elements in the iterable are identical.

qumin.representations.patternstore module¶

class qumin.representations.patternstore.PatternStore(*args, **kwargs)[source]¶

Bases: dict

This class stores all alternation patterns computed for a paradigm.

Variables:: cells (list[str]) – cells for which patterns are registered.

add_frequencies(frequencies, token_patterns=False, token_predictors=False, token_oa=False)[source]¶

Retrieves frequencies for pairs of forms and patterns. Frequencies are added to the following columns:

f_pred: frequency of the predictor.

f_pair: relative frequency of the pattern within the lexeme.

Parameters:

token_patterns (bool) – Whether to use token frequencies to compute pattern probabilities.
token_predictors (bool) – Whether to use token frequencies to weight the predictors.
token_oa (bool) – Whether to use token frequencies to compute the relative weight of overabundant forms.

algorithm = 'unset'¶

cells = []¶

export(md, optim_mem=False, **kwargs)[source]¶

Export dataframes to a folder for later use.

Parameters:: optim_mem (bool) – Whether to not export human readable patterns too. Defaults to False.

find_applicable(cpus=1, **kwargs)[source]¶

Find all applicable rules for each form.

We name sets of applicable rules classes. Classes are oriented: we produce two separate columns (a, b) and (b, a) for each pair of columns (a, b) in the paradigm..

Returns:: associating a lemma (index) and an ordered pair of paradigm cells (columns) to a tuple representing a class of applicable patterns.
Return type:: pandas.DataFrame

find_patterns(paradigms, *args, algorithm='edits', disable_tqdm=False, cpus=1, optim_mem=False, **kwargs)[source]¶

Find Patterns in a DataFrame.

Algorithms can be:

edits (dynamic alignment using levenshtein scores)
phon (dynamic alignment using segment similarity scores)

Patterns are chosen according to their coverage and accuracy among competing patterns, and they are merged as much as possible. Their alternation can be generalized.

This method updates the internal dict and does not return anything.

The internal dict is of shape dict of tuples to pd.DataFrame, where the tuples are pairs of cells and the dataframes hold patterns for pairs of cells.

Parameters:

paradigms (qumin.representations.paradigms.Paradigms) – Paradigms object.
algorithm (str) – method for scoring the best pairwise alignments. Can be “edits” or “phon”.
disable_tqdm (bool) – if true, do not show progressbar
cpus (int) – number of CPUs to use for parallelisation (defaults to 1)
optim_mem (bool) – whether to convert patterns to str to use less memory (defaults to False)

from_csv(path, pair, patterns_map, collection, paradigms, defective=True, overabundant=True, cells=None)[source]¶

Read a patterns dataframe for a specific pair of cells

Parameters:

paradigms (qumin.representations.paradigms.Paradigms) – Paradigms representation.
defective (bool) – whether to consider defective lexemes.
overabundant (bool) – whether to consider overabundance.
collection (defaultdict) – a defaultdict to avoid recomputing patterns from strings.
pair (tuple)
patterns_map (pandas.DataFrame) – a DataFrame of patterns and pattern ids.

from_file(patterns_md, paradigms, force=False, **kwargs)[source]¶

Read pattern data from a previous export.

Parameters:

patterns_md (qumin.utils.Metadata) – metadata handler from a previous run.
paradigms (qumin.representations.paradigms.Paradigms) – Paradigms representation.

get_pairs()[source]¶

has_applicable = False¶

info()[source]¶

pairs = []¶

to_csv(md, rel_path='patterns/machine_readable/')[source]¶

Export a Patterns DataFrame to csv.

Parameters:

md (qumin.utils.Metadata) – Metadata handler.
rel_path (str) – Relative path to the folder.

to_incidence_table(weighted=False, lexemes=None, flatten_columns=False)[source]¶

By default, creates a onehot encoding of the lexemes (the features are the patterns).

If weighted=True, the produced table weights overabundant patterns with their relative frequency (it is no more a onehot encoding, but it can be useful in various situations).

If you use the weighting option, please ensure you previously added frequencies to your patterns with add_frequencies(). This is needed even if using a uniform distribution, to initialize it.

Parameters:

lexemes (iterable) – lexemes to consider, None defaults to all.
weighted (bool) – Whether to weight overabundant patterns by their frequency.
flatten_columns (bool) – Whether to flatten the multiindex.

Example

Lexeme	(A, B), pat1	(A,B), pat2	(A,C), pat3	…
lexeme_1	1	0	1	…
lexeme_2	0	1	1	…

Returns:: associating lexemes (rows) with patterns(columns)
Return type:: pandas.DataFrame

to_lattice(frequencies=None, **kwargs)[source]¶

Creates a Lattice from a patterns object.

Parameters:

patterns (qumin.representations.patternstore.PatternStore) – the patterns computed for the paradigms.
frequencies (qumin.representations.frequencies.Frequencies) – Will be used to sort microclasses.
kwargs (dict) – Optional keyword arguments will be passed to ICLattice.__init__()

Results:: qumin.lattice.lattice.ICLattice: An inflection class lattice.

to_md(md, pair, rel_path, abs_path)[source]¶

Export a Patterns DataFrame as a pretty markdown file

Parameters:

md (qumin.utils.Metadata) – Metadata handler.
pair (Tuple[str, str]) – pair of cells for which a pattern file should be exported.
rel_path (str) – Relative path to the folder.
abs_path (str) – Absolute path to the folder.

to_microclasses(incidence_table, frequencies=None)[source]¶

Computes microclasses from an incidence table.

Parameters:

onehot_encoding (pd.DataFrame) – a onehot_encoding or an incidence table.
frequencies (qumin.representations.frequencies.Frequencies) – an optional frequencies object
lexemes. ((typically accessed through paradigms.frequencies) to sort the)

Results:: dict[Str, List]: a dictionary mapping a label to all lexemes of the microclass.

unmerge_columns(paradigms)[source]¶

Recreates merged columnss

Parameters:: paradigms (qumin.representations.Paradigms) – a Paradigms object.

qumin.representations.patternstore.patterns_job(*args)[source]¶

qumin.representations.quantity module¶

author: Sacha Beniamine.

This module provides Quantity objects to represent quantifiers.

class qumin.representations.quantity.Quantity(mini, maxi)[source]¶

Bases: object

Represents a quantifier as an interval.

This is a flyweight class and the presets are :

description	mini	maxi	regex symbol	variable name
Match one	1	1		quantity.one
Optional	0	1	?	quantity.optional
Some	1	inf	+	quantity.some
Any	0	inf	*	quantity.kleenestar
None	0	0

__init__(mini, maxi)[source]¶

Parameters:

mini (int) – the minimum number of elements matched.
maxi (int) – the maximum number of elements matched.

qumin.representations.quantity.quantity_largest(args)[source]¶

Reduce on the “&” operator of quantities.

Returns a quantity with the minimum left value and maximum right value.

Example

>>> quantity_largest([Quantity(0,1),Quantity(1,1),Quantity(1,np.inf)])
Quantity(0,inf)

Argument:: args: an iterable of quantities.

qumin.representations.quantity.quantity_sum(args)[source]¶

Reduce on the “+” operator of quantities.

Returns a quantity with the minimum left value and the sum of the right value.

Example

>>> quantity_largest([Quantity(0,1),Quantity(1,1),Quantity(0,0)])
Quantity(0,1)

Argument:: args: an iterable of quantities.

qumin.representations.segments module¶

author: Sacha Beniamine.

This module addresses the modelisation of phonological segments.

class qumin.representations.segments.Form(contents, form_id=None)[source]¶

Bases: str

A form is a string of sounds, separated by spaces. If a form is provided as defective, this information is still stored as a Form object with empty content. Defectiveness can be tested with:

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> Form('').is_defective()
True

By default, we segment by cutting on spaces. If resegment=True, we remove spaces, and segment using the sound inventory’s list of valid phonemes.

Sounds might be more than one character long. Forms are strings, they are segmented at the object creation.

Variables:

tokens (Tuple) – Tuple of phonemes contained in this form. For defective entries, tokens are an empty tuple.
id (str) – form_id of the corresponding form according to the Paralex package. If unknown, None will be assigned.

__init__(string, form_id=None)[source]¶: The constructor assumes everything is already clean and normalized

classmethod from_raw(string, inventory, form_id=None, resegment=False)[source]¶

Use inventory to build a cleaned and normalized Form.

Parameters:

string – raw string for this form
inventory (segments.Inventory) – sound inventory
form_id – form identifier
resegment (bool) – defaults to False. Whether to re-segment phoneme tokens.

Returns: a formatted Form

is_defective()[source]¶

class qumin.representations.segments.Inventory(table, shorthands_table, normalization)[source]¶

Bases: object

The static segments.Inventory class describes a sound inventory.

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")

Each sound class in the inventory is a concept in a FCA lattice. Sound class identifiers are either strings (for phonemes) or frozensets (for sound classes). Phonemes are the leaves of the hierarchy.

Sound classes can be seen as under-determined phonemes, and both phonemes and sound classes are handled in the same way. For this reason, we call both “sound”.

Variables:

context – the FCA context underlying the feature space
_score_matrix (dict) – a dictionnary of sound tuples to alignment score
_gap_score (float) – a score for insertions
_normalization (dict) – a dictionnary of sounds to their normalized counterparts
_segmenter (re.Pattern) – a compiled regex to segment words into phonemes
_legal_str (re.Pattern) – a compiled regex to recognize words made of known phonemes
_max (frozenset) – the identifier of the supremum in the lattice
_regexes (dict) – a dictionnary of sound IDs to regex strings
_pretty_str (dict) – a dictionnary of sound IDs to pretty formatted strings
_features (dict) – a dictionnary of sound IDs to set of features
_features_str (dict) – a dictionnary of sound IDs to a string representing features
_classes (dict) – a dictionnary of sound IDs to a list of classes (ancestors)

classmethod calc_shorthand(lattice, shorthands)[source]¶

Calculate shorthand names for some lattice nodes.

Ex: ##C## in the sounds table might be a shorthand “C” for all consonants.

Parameters:

lattice – concept lattice
shorthands – table or shorthand definitions

Returns:

a dictionary of intents to their shorthand names

check_validity(lattice, table)[source]¶

Check validity of this sound inventory for Qumin.

Identifies when some segments are ancestors of others (some segments only differ from others through underspecification)

Parameters:

lattice – concept lattice
table – table of sounds

Raises: Exception

features(sound, **kwargs)[source]¶

Returns a set of features representing a sound.

Parameters:: sound – identifier of a sound
Returns:: features
Return type:: (set)

features_str(sound, **kwargs)[source]¶

Returns a string which described the features of a sound.

Parameters:: sound – identifier of a sound
Returns:: features string
Return type:: (str)

classmethod from_file(filename)[source]¶

Initializes the inventory

Parameters:: filename – path to a csv or tsv file with distinctive features

get(descriptor)[source]¶

Get a sound using the lattice.

Parameters:: descriptor – iterable of phonemes OR iterable of features

Returns: (str or frozenset) sound identifier

get_from_transform(a, transform)[source]¶

Get a segment from another according to a transformation tuple.

Parameters:

a (str) – Segment alias
transform (tuple) – Couple of two segment IDs

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> inv.get_from_transform("d",
...                                     (frozenset({"d","t"}),
...                                     frozenset({"s","z"})))
'z'

get_transform_features(left, right)[source]¶

Get the features corresponding to a transformation.

Parameters:

left (frozenset) – set of phonemes
right (frozenset) – set of phonemes

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> inv.get_transform_features({"b","d"}, {"p","t"})
(frozenset({'+voi'}), frozenset({'-voi'}))

id_to_frozenset(sound_id)[source]¶

inf(a, b)[source]¶

Checks if a is a descendant of b.

a < b iff b has children and either a is a string which is part of b, or a is a subset of b.

infos(sound)[source]¶

String giving all useful information on a sound.

Parameters:: sound – identifier of a sound
Returns:: pretty string and features of a sound.

init_dissimilarity_matrix(gap_prop=0.5, **kwargs)[source]¶: Computes score matrix with dissimilarity scores.

insert_cost(*_)[source]¶: Returns the constant insertion/deletion cost

static is_leaf(sound)[source]¶

Returns whether this sound is a leaf (a phoneme, rather than a sound class)

Parameters:: sound – identifier of a sound

Returns:

meet(*args)[source]¶

Finds the lowest common ancestors of segments from their identifiers.

Args: several sound identifiers

Returns:: lowest common ancestor identifier

pretty_str(sound, **kwargs)[source]¶

Returns a pretty string representing a sound.

Parameters:: sound – identifier of a sound
Returns:: pretty string
Return type:: (str)

classmethod read_sounds_file(filename)[source]¶

Read a sound file from file.

Parameters:: filename – path to the file

Returns: (table, shorthands, normalization): table (pd.DataFrame): normalized sound table shorthands (pd.DataFrame): table holding shorthands for some concepts normalization (dict): dictionary of normalizations for identical rows.

regex(sound)[source]¶

Returns a regex representing a sound.

Parameters:: sound – identifier of a sound
Returns:: regex string
Return type:: (str)

segment_form(wordform, resegment=False)[source]¶

Segment a form into phonemes (either following spaces, or using sound inventory)

Parameters:

wordform (str) – phonemic form
resegment (bool) – Whether to ignore spaces in phon forms and re-compute phonemic segmentation

Returns:

list of phonemes

shortest(sound, **kwargs)[source]¶

Returns a string which describes the sound in as little characters as possible.

Parameters:: sound – identifier of a sound
Returns:: short string
Return type:: (str)

show_pool()[source]¶: Return a string description of the whole segment pool.

similarity(a, b)[source]¶

Computes phonological similarity (Frisch, 2004)

Measure from “Similarity avoidance and the OCP” , Frisch, S. A.; Pierrehumbert, J. B. & Broe, M. B. Natural Language & Linguistic Theory, Springer, 2004, 22, 179-228, p. 198.

We compute similarity by comparing the number of shared and unshared natural classes of two consonants, using the equation in (7). This equation is a direct extension of the Pierrehumbert (1993) feature similarity metric to the case of natural classes.

\(Similarity = \frac{\text{Shared natural classes}}{\text{Shared natural classes } + \text{Non-shared natural classes}}\)

sub_cost(a, b)[source]¶

Returns the cost of aligning sounds a and b

Parameters:

a – sound identifier
b – sound identifier

Returns: (float): substitution cost

transformation(a, b)[source]¶

Find a transformation between a and b.

The transformation is a pair of two maximal sets of segments related by a bijective phonological function.

This function takes a pair of sound identifiers and calculates the function which relates these two segments. It then finds and returns the two maximal sets of segments related by this function.

Example

In French, t -> s can be expressed by a phonological function which changes [-cont] and [-rel. ret] to [+cont] and [+rel. ret]

These other segments are related by the same change: d -> z b -> v p -> f

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> a,b = inv.transformation("t","s")
>>> a == frozenset({'d', 't', 'b', 'p'})
True
>>> b == frozenset({'s', 'z', 'f', 'v'})
True

Parameters:

a (str) – Segment identifiers.
b (str) – Segment identifiers.

Returns:

two sets of sounds.

Return type:

tuple of frozenset

qumin.representations.segments.normalize(ipa, features)[source]¶

Assign a normalized segment to groups of segments with identical rows.

This function takes a segments table and adds in place a “Normalized” column. This column contains a common value for each segment with identical boolean values. The function also returns a translation table mapping indexes to normalized segments.

Note: the index are expected to be one char length.

Index	..features..	Normalized
ɛ	[…]	E
e	[…]	E

Parameters:

ipa (pandas.DataFrame) – Dataframe of segments. Columns are features, UNICODE code point representation and segment names, indexes are segments.
features (list) – Feature columns’ names.

Returns:

translation table from: the segment’s name to its normalized name.

Return type:

dict

qumin.representations.segments.shorten_feature_names(table)[source]¶

qumin.representations.segments.sound_lattice_context(dataframe)[source]¶

Create a Context from a dataframe of properties.

Parameters:: dataframe (pandas.DataFrame) – A dataframe of sound / features incidence. -1 means unapplicable 1 means +feature 0 means -feature
Returns:: the created Context
Return type:: concepts.Context

Module contents¶

author: Sacha Beniamine.

Utility functions for representations.