qumin.representations package

Submodules

qumin.representations.alignment module

author: Sacha Beniamine.

This module is used to align sequences.

qumin.representations.alignment.align_auto(s1, s2, insert_cost, sub_cost, distance_only=False, fillvalue='', **kwargs)[source]

Return all the best alignments of two words according to some edit distance matrix.

Parameters:
  • s1 (str) – first word to align

  • s2 (str) – second word to align

  • insert_cost (Callable) – A function which takes one value and returns an insertion cost

  • sub_cost (Callable) – A function which takes two values and returns a substitution cost

  • distance_only (bool) – defaults to False. If True, returns only the best distance. If False, returns an alignment.

  • fillvalue – (optional) the value with which to pad when iterable have varying lengths. Default: “”.

Returns:

Either an alignment (a list of list of zipped tuples), or a distance (if distance_only is True).

qumin.representations.alignment.align_baseline(*args, **kwargs)[source]

Simple alignment intended as an inflectional baseline. (Albright & Hayes 2002)

single change, either suffixal, or suffixal, or infixal. This doesn’t work well when there is both a prefix and a suffix. Used as a baseline for evaluation of the auto-aligned patterns.

see “Modeling English Past Tense Intuitions with Minimal Generalization”, Albright, A. & Hayes, B. Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning - Volume 6, Association for Computational Linguistics, 2002, 58-69, page 2 :

“The exact procedure for finding a word-specific rule is as follows: given an input pair (X, Y), the model first finds the maximal left-side substring shared by the two forms (e.g., #mɪs), to create the C term (left side context). The model then exam- ines the remaining material and finds the maximal substring shared on the right side, to create the D term (right side context). The remaining material is the change; the non-shared string from the first form is the A term, and from the second form is the B term.”

Examples

>>> align_baseline("mɪs","mas")
[('m', 'm'), ('ɪ', 'a'), ('s', 's')]
>>> align_baseline("mɪs","mɪst")
[('m', 'm'), ('ɪ', 'ɪ'), ('s', 's'), ('', 't')]
>>> align_baseline("mɪs","amɪs")
[('', 'a'), ('m', 'm'), ('ɪ', 'ɪ'), ('s', 's')]
>>> align_baseline("mɪst","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
Parameters:
  • *args – any number of iterables >= 2

  • fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.

Returns:

a list of zipped tuples.

qumin.representations.alignment.align_left(*args, **kwargs)[source]

Align left all arguments (wrapper around zip_longest).

Examples

>>> align_left("mɪs","mas")
[('m', 'm'), ('ɪ', 'a'), ('s', 's')]
>>> align_left("mɪs","mɪst")
[('m', 'm'), ('ɪ', 'ɪ'), ('s', 's'), ('', 't')]
>>> align_left("mɪs","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('', 's')]
>>> align_left("mɪst","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
Parameters:
  • *args – any number of iterables >= 2

  • fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.

Returns:

a list of zipped tuples, left aligned.

qumin.representations.alignment.align_multi(*strings, **kwargs)[source]

Levenshtein-style alignment over arguments, two by two.

qumin.representations.alignment.align_right(*iterables, **kwargs)[source]

Align right all arguments. Zip longest with right alignment.

Examples

>>> align_right("mɪs","mas")
[('m', 'm'), ('ɪ', 'a'), ('s', 's')]
>>> align_right("mɪs","mɪst")
[('', 'm'), ('m', 'ɪ'), ('ɪ', 's'), ('s', 't')]
>>> align_right("mɪs","amɪs")
[('', 'a'), ('m', 'm'), ('ɪ', 'ɪ'), ('s', 's')]
>>> align_right("mɪst","amɪs")
[('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
Parameters:
  • *iterables – any number of iterables >= 2

  • fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.

Returns:

a list of zipped tuples, right aligned.

qumin.representations.alignment.commonprefix(*args)[source]

Given a list of strings, returns the longest common prefix

qumin.representations.alignment.commonsuffix(*args)[source]

Given a list of strings, returns the longest common suffix

qumin.representations.alignment.edits_ins_cost(*_)[source]
qumin.representations.alignment.edits_sub_cost(a, b)[source]
qumin.representations.alignment.multi_sub_cost(a, b)[source]

qumin.representations.contexts module

author: Sacha Beniamine.

This module implements patterns’ contexts, which are series of phonological restrictions.

class qumin.representations.contexts.Context(segments, inv)[source]

Bases: object

Context for an alternation pattern

feat_str(inv)[source]
classmethod merge(contexts, inv)[source]

Merge contexts to generalize them.

Merge contexts and combine their restrictions into a new context.

Parameters:
  • contexts – iterable of Contexts.

  • inv – Inventory instance

Returns:

a merged context

to_str(inv, mode=2)[source]

qumin.representations.frequencies module

author: Jules Bouton.

Class for frequency management.

class qumin.representations.frequencies.Frequencies(package, *args, source=False, **kwargs)[source]

Bases: object

Frequency management for a Paralex dataset. Frequencies are built for forms, lexemes and cells.

The parsed frequency columns or tables should conform to the Paralex principles:
  • An empty value means that there is no measure available

  • A zero value means that there is a measure, which is zero

When aggregating accross rows, any empty cell yields a uniform distribution for the whole set of rows, whereas zeros are taken into account. This behaviour can be disabled for some functions by passing skipna=True.

Examples

>>> p = fl.Package('tests/data/TestPackage/test.package.json')
>>> f = Frequencies(p)
>>> print(f.info().to_markdown())
| Table   | Source      |   Records |   Sum(f) |   Mean(f) |
|:--------|:------------|----------:|---------:|----------:|
| forms   | forms_table |        22 |      519 |   27.3158 |
| lexemes | forms_table |         4 |      519 |  129.75   |
| cells   | forms_table |         4 |      519 |  129.75   |
Variables:
  • p (frictionless.Package) – package to analyze

  • source (Dict[str, str]) – source used by default for each table. Contains either a value for the source field of a Paralex frequency table, or the name of the table used to extract the frequency.

  • forms (pandas.DataFrame) – Table of frequency values associated to a form_id.

  • lexemes (pandas.DataFrame) – Table of frequency values associated to a lexeme_id.

  • cells (pandas.DataFrame) – Table of frequency values associated to a cell_id.

__init__(package, *args, source=False, **kwargs)[source]

Constructor for Frequencies. We gather and store frequencies for forms, lexemes and cells. Behaviour is the following:

  • If force_uniform is True, we use the paradigms table to generate a Uniform distribution.

  • If not, we try to get a frequency column from the tables: form, lexemes, cell

  • If any of those is missing, we use the frequencies table.

  • If we can’t use the frequency table, we fall back to a uniform.

Parameters:
  • package (frictionless.Package) – package to analyze

  • source (Dict[str, str]) – name of the source to use when several are available.

  • **kwargs – keyword arguments for frequency reading methods.

col_names = ['lexeme', 'cell', 'form']
drop_unused(paradigms)[source]

If the paradigms table implied some sampling / filtering, make sure that the frequencies are also sampled.

get_absolute_freq(mean=False, group_on=False, skipna=False, **kwargs)[source]

Return the frequency of an item for a given source

The frequency of an item is defined as the sum of the frequencies of this item across all rows.

Examples

>>> p = fl.Package('tests/data/TestPackage/test.package.json')
>>> f = Frequencies(p)
>>> f.get_absolute_freq(filters={'lexeme':'q'}, group_on="index", skipna=True)
form
11    12.0
12     6.0
14    20.0
18     NaN
23    20.0
Name: value, dtype: float64
>>> float(f.get_absolute_freq(filters={'lexeme':'q'}))
nan
>>> float(f.get_absolute_freq(filters={'cell':'third'}, mean=True, skipna=True))
20.0
>>> f.get_absolute_freq(group_on=['lexeme'])
lexeme
k    203.0
p      NaN
q      NaN
s     63.0
Name: value, dtype: float64
Parameters:
  • group_on (List[str]) – columns for which absolute frequencies should be computed. If False, aggregates across all records.

  • mean (bool) – Defaults to False. If True, returns a mean instead of a sum.

  • skipna (bool) – Defaults to False. Skip nan values for sums or means.

Returns:

a Series which contains the output values.

The index is either the original one, or the grouping columns.

Return type:

pandas.Series

get_relative_freq(group_on=False, uniform_duplicates=False, **kwargs)[source]

Returns the relative frequencies of a set of rows according to a set of grouping columns. If any of the values is empty, we generate a Uniform distribution for this group.

Note

To avoid long computations, we use C implementations. Unfortunately, skipna is not yet implemented in GroupBy.sum. For this reason, we use a more complex pipeline of C functions.

Examples

>>> p = fl.Package('tests/data/TestPackage/test.package.json')
>>> f = Frequencies(p)
>>> f.get_relative_freq(filters={'lexeme': 'p', 'cell':'first'}, group_on=["lexeme"])['result'].values
array([0.05882353, 0.94117647])
>>> f.get_relative_freq(filters={'lexeme': 's', 'cell':'second'}, group_on=["lexeme"])['result'].values
array([0., 1.])
>>> f.get_relative_freq(filters={'cell':"third"}, group_on=["cell"])['result'].values
array([0.25, 0.25, 0.25, 0.25])
>>> f.get_relative_freq(filters={'lexeme':'p'}, group_on=["lexeme", "cell"])['result'].values
array([0.05882353, 0.94117647, 1.        , 1.        , 1.        ])
>>> f.get_relative_freq(filters={'lexeme':'s', 'cell': 'first'}, group_on=["lexeme", "cell"]).result.values
array([0.33333333, 0.33333333, 0.33333333])
Parameters:
  • group_on (List[str]) – column on which relative frequencies should be computed

  • uniform_duplicates (bool) – Whether to give a uniform weight to duplicate items or a relative weight based on tokens.

Returns:

a DataFrame which contains a result column with the output value.

The index is the original one. The grouping columns are also provided.

Return type:

pandas.DataFrame

has_frequencies(table='forms')[source]

Returns True if the requested contains real frequencies.

Parameters:

table (str) – name of the table to test.

info()[source]

Returns a convenient DataFrame with summary statistics.

Returns:

A summary of statistics about this Frequencies handler.

Return type:

pandas.DataFrame

p = None
source = {'cells': None, 'forms': None, 'lexemes': None}

qumin.representations.generalize module

author: Sacha Beniamine.

This module is used to generalize pats contexts.

qumin.representations.generalize.generalize_alt(patterns, inv)[source]

Use the generalized alternation, using features when possible rather than segments.

qumin.representations.generalize.generalize_patterns(pats, inv)[source]

Generalize these patterns’ context.

Parameters:
  • pats (Iterable[Pattern]) – the patterns to generalize

  • inv – an Iventory instance

Returns:

a new pattern

Return type:

Pattern

qumin.representations.generalize.incremental_generalize_patterns(pats, inv)[source]

Merge patterns incrementally as long as the pattern has the same coverage.

Attempt to merge each patterns two by two, and refrain from doing so if the pattern doesn’t match all the lexemes that lead to its inference. Also attempt to merge together patterns that have not been merged with others.

Parameters:
  • pats – the patterns

  • inv – Inventory instance

Returns:

a list of patterns, at best of length 1, at worst of the same length as the input.

Return type:

List[Pattern]

qumin.representations.paradigms module

author: Sacha Beniamine and Jules Bouton.

Paradigms class to represent paralex paradigms.

class qumin.representations.paradigms.Paradigms(dataset, **kwargs)[source]

Bases: object

Paradigms with methods to normalize them, merge and restore columns, etc.

__init__(dataset, **kwargs)[source]

Read paradigms data, and prepare it according to a Segment class pool.

Parameters:
  • dataset (frictionless.Package) – paralex frictionless Package All characters occuring in the paradigms except the first column should be inventoried in this class.

  • kwargs – additional arguments passed to Package.preprocess()

Returns:

paradigms table

(rows contain forms, lemmas, cells).

Return type:

paradigms (pandas.DataFrame)

cells = None
cells_dedup = None
data = None
default_cols = ('lexeme', 'cell', 'phon_form')
find_cell_duplicates()[source]

Identify duplicate cells (same forms everywhere).

get_empty_pattern_df(a, b)[source]

Returns an oriented dataframe to store patterns for two cells.

Parameters:
  • a (str) – cell A name

  • b (str) – cell B name

preprocess(fillna=True, segcheck=True, defective=False, overabundant=False, cells=None, sample_lexemes=None, sample_cells=None, sample_kws=None, pos=None, resegment=False, lexemes_list=None, **kwargs)[source]
Preprocess a Paralex paradigms table to meet the requirements of Qumin:
  • Filter by POS and by cells

  • Filter by frequency, sample

  • Filter overabundance and defectivity

  • Merge identical columns

  • Check segments and create Form() objects

Parameters:
  • fillna (bool) – Defaults to True. Should #DEF# be replaced by np.NaN ? Otherwise they are filled with empty strings (“”).

  • segcheck (bool) – Defaults to True. Should I check that all the phonological segments in the table are defined in the segments table?

  • defective (bool) – Defaults to False. Should I keep rows with defective forms?

  • overabundant (bool) – Defaults to False. Should I keep rows with overabundant forms?

  • cells (List[str]) – List of cell names to consider. Defaults to all.

  • pos (List[str]) – List of parts of speech to consider. Defaults to all.

  • lexemes_list (path) – Path to a file containing one lexeme per row.

  • sample_lexemes (int) – Defaults to None. Should I sample n lexemes (for debug purposes)?

  • sample_cells (int) – Defaults to None. Should I sample n lexemes (for debug purposes)?

  • sample_kws (dict) – Dict of keywords passed to _sample_paradigms().

  • resegment (bool) – Defaults to False. Should I resegment the paradigms?

qumin.representations.patterns module

author: Sacha Beniamine.

This module addresses the modeling of inflectional alternation patterns.

exception qumin.representations.patterns.NotApplicable[source]

Bases: Exception

Raised when a Pattern can’t be applied to a form.

class qumin.representations.patterns.Pattern(alternation, context, inv)[source]

Bases: object

Represent the alternation pattern between two forms.

Applying the pattern to one of the original forms yields the second one.

As an example, we will use the following alternation in a present verb of french:

cells

Forms

Transcription

prs.1.sg ⇌ prs.2.pl

j’amène ⇌ vous amenez

amEn ⇌ amənE

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> cells = ("prs.1.sg", "prs.2.pl")
>>> forms = (Form("a m E n"), Form("a m Ø n E"))
>>> p = Pattern.from_forms(cells, forms, inv)
>>> type(p)
<class 'qumin.representations.patterns.Pattern'>
>>> p
E_ ⇌ Ø_E / am_n_ <0>
>>> p.apply(Form("a m E n"), cells, inv)
Form(a m Ø n E)
__init__(alternation, context, inv)[source]

Constructor for Patterns.

Parameters:
  • cells (Iterable) – Cells labels (str), in the same order.

  • alternation (dict) – Dictionary of cells to alternating material (list of tuples)

  • context (bool) – a Context instance

  • inv – sounds Inventory

applicable(form, cell)[source]

Test if this pattern matches a form, i.e. if the pattern is applicable to the form.

Parameters:
  • form (str) – a form.

  • cell (str) – A cell contained in self.cells.

Returns:

whether the pattern is applicable to the form from that cell.

Return type:

bool

apply(form, names, inv, raiseOnFail=True)[source]

Apply the pattern to a form.

Parameters:
  • form – a form, assumed to belong to the cell names[0].

  • names – apply to a form of cell names[0] to produce a form of cell names[1] (default:self.cells). Patterns being non-oriented, it is better to use the names argument.

  • inv (segments.Inventory) – sound inventory

  • raiseOnFail (bool) – defaults to True. If true, raise an error when the pattern is not applicable to the form. If False, return None instead.

Returns:

form belonging the opposite cell.

classmethod from_aligned(cells, alignment, inv)[source]

Create a pattern fron aligned forms (aligns them left)

Parameters:
  • cells (Iterable) – Cells labels (str), in the same order.

  • alignment (Iterable) – Alogned foorms (str) to be segmented.

classmethod from_forms(cells, forms, inv)[source]

Create a pattern fron unaligned forms (aligns them left)

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> cells = ("prs.1.sg", "prs.2.pl")
>>> forms = (Form("a m E n"), Form("a m Ø n E"))
>>> p = Pattern.from_forms(cells, forms, inv)
>>> p
E_ ⇌ Ø_E / am_n_ <0>
>>> p.score # is zero at initialization
0
>>> p.lexemes # is empty at initialization
set()
>>> p.alternation
{'prs.1.sg': [('E',), ('',)], 'prs.2.pl': [('Ø',), ('E',)]}
>>> p.context # this is a Context
((?:a )(?:m )){}((?:n )){}
>>> p.cells
('prs.1.sg', 'prs.2.pl')
>>> p._repr
'E_ ⇌ Ø_E / am_n_'
>>> p._feat_str
'E_ ⇌ Ø_E / am_n_'
>>> p._gen_alt == {'prs.1.sg': ((frozenset({'ɑ̃', 'ɛ̃', 'i', 'j', 'E'}),), ('',)),
...                'prs.2.pl': ((frozenset({'ɥ', 'ɔ̃', 'y', 'Ø', 'œ̃'}),), ('E',))}
True
classmethod from_str(cells, string, inv)[source]

Parse an exported pattern.

To be parsed back, patterns need to be exported by repr(), not str().

Note: Phonemes in context classes are now separated by “,”

Parameters:
  • cells (tuple of str) – Cells labels (str).

  • string (str) – pattern given as a string.

  • inv (Inventory) – Sound inventory.

Returns:

a parsed Pattern object.

Return type:

Pattern

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> p = Pattern.from_str(('A', 'B'), "ɥ ⇌ yj / {E,O,a,b,d,f,g,i,j,k,l,m,n,p,s,t,u,v,w,y,z,Ø,ŋ,œ̃,ɑ̃,ɔ̃,ɛ̃,ɥ,ɲ,ʁ,ʃ,ʒ}*{b,d,f,g,k,l,m,n,p,s,t,v,z,ŋ,ɲ,ʁ,ʃ,ʒ}_E <58>", inv)
>>> type(p) is Pattern
True
>>> str(p)
'ɥ ⇌ yj / X*C_E'
>>> p
ɥ ⇌ yj / {E,O,a,b,d,f,g,i,j,k,l,m,n,p,s,t,u,v,w,y,z,Ø,ŋ,œ̃,ɑ̃,ɔ̃,ɛ̃,ɥ,ɲ,ʁ,ʃ,ʒ}*{b,d,f,g,k,l,m,n,p,s,t,v,z,ŋ,ɲ,ʁ,ʃ,ʒ}_E <58.0>
>>> p = Pattern.from_str(('A','B'), "E_ ⇌ Ø_E / am_n_ <0>", inv)
>>> type(p) is Pattern
True
>>> p
E_ ⇌ Ø_E / am_n_ <0.0>
is_identity()[source]

Checks whether this pattern is an identity pattern.

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> p = Pattern.new_identity(("A", "B"), inv)
>>> p.is_identity()
True
classmethod new_identity(cells, inv)[source]

Identity pattern factory.

The alternation is empty, and the context is a sequence of any number of allowed segments.

Parameters:
  • cells – Pair of cell for this pattern.

  • inv (Inventory) – Sound Inventory.

Returns:

a new identity pattern.

Return type:

Pattern

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> print(Pattern.new_identity(('A','B'), inv))
 ⇌  / X*
to_alt(inv, exhaustive_blanks=True, use_gen=False, **kwargs)[source]

Build a string representing the alternation

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> cells = ("prs.1.sg", "prs.2.pl")
>>> forms = (Form("a m E n"), Form("a m Ø n E"))
>>> p = Pattern.from_forms(cells, forms, inv)
>>> p.alternation
{'prs.1.sg': [('E',), ('',)], 'prs.2.pl': [('Ø',), ('E',)]}
>>> p.to_alt(inv)
'_E_ ⇌ _Ø_E'
>>> p.to_alt(inv, exhaustive_blanks=False)
'E_ ⇌ Ø_E'
>>> p.to_alt(inv, use_gen=True)
'_[-arro]_ ⇌ _[+arro]_E'
Parameters:
  • exhaustive_blanks (bool) – Whether initial and final contexts should be marked by a filler.

  • use_gen (bool) – Whether the alternation should use phonological generalizations (when available).

Returns:

A string representing the alternation, with contexts positions replaced by the filler “_”.

qumin.representations.patterns.are_all_identical(iterable)[source]

Test whether all elements in the iterable are identical.

qumin.representations.quantity module

author: Sacha Beniamine.

This module provides Quantity objects to represent quantifiers.

class qumin.representations.quantity.Quantity(mini, maxi)[source]

Bases: object

Represents a quantifier as an interval.

This is a flyweight class and the presets are :

description

mini

maxi

regex symbol

variable name

Match one

1

1

quantity.one

Optional

0

1

?

quantity.optional

Some

1

inf

+

quantity.some

Any

0

inf

*

quantity.kleenestar

None

0

0

__init__(mini, maxi)[source]
Parameters:
  • mini (int) – the minimum number of elements matched.

  • maxi (int) – the maximum number of elements matched.

qumin.representations.quantity.quantity_largest(args)[source]

Reduce on the “&” operator of quantities.

Returns a quantity with the minimum left value and maximum right value.

Example

>>> quantity_largest([Quantity(0,1),Quantity(1,1),Quantity(1,np.inf)])
Quantity(0,inf)
Argument:

args: an iterable of quantities.

qumin.representations.quantity.quantity_sum(args)[source]

Reduce on the “+” operator of quantities.

Returns a quantity with the minimum left value and the sum of the right value.

Example

>>> quantity_largest([Quantity(0,1),Quantity(1,1),Quantity(0,0)])
Quantity(0,1)
Argument:

args: an iterable of quantities.

qumin.representations.segments module

author: Sacha Beniamine.

This module addresses the modelisation of phonological segments.

class qumin.representations.segments.Form(contents, form_id=None)[source]

Bases: str

A form is a string of sounds, separated by spaces. If a form is provided as defective, this information is still stored as a Form object with empty content. Defectiveness can be tested with:

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> Form('').is_defective()
True

By default, we segment by cutting on spaces. If resegment=True, we remove spaces, and segment using the sound inventory’s list of valid phonemes.

Sounds might be more than one character long. Forms are strings, they are segmented at the object creation.

Variables:
  • tokens (Tuple) – Tuple of phonemes contained in this form. For defective entries, tokens are an empty tuple.

  • id (str) – form_id of the corresponding form according to the Paralex package. If unknown, None will be assigned.

__init__(string, form_id=None)[source]

The constructor assumes everything is already clean and normalized

classmethod from_raw(string, inventory, form_id=None, resegment=False)[source]

Use inventory to build a cleaned and normalized Form.

Parameters:
  • string – raw string for this form

  • inventory (segments.Inventory) – sound inventory

  • form_id – form identifier

  • resegment (bool) – defaults to False. Whether to re-segment phoneme tokens.

Returns: a formatted Form

is_defective()[source]
class qumin.representations.segments.Inventory(table, shorthands_table, normalization)[source]

Bases: object

The static segments.Inventory class describes a sound inventory.

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")

Each sound class in the inventory is a concept in a FCA lattice. Sound class identifiers are either strings (for phonemes) or frozensets (for sound classes). Phonemes are the leaves of the hierarchy.

Sound classes can be seen as under-determined phonemes, and both phonemes and sound classes are handled in the same way. For this reason, we call both “sound”.

Variables:
  • context – the FCA context underlying the feature space

  • _score_matrix (dict) – a dictionnary of sound tuples to alignment score

  • _gap_score (float) – a score for insertions

  • _normalization (dict) – a dictionnary of sounds to their normalized counterparts

  • _segmenter (re.Pattern) – a compiled regex to segment words into phonemes

  • _legal_str (re.Pattern) – a compiled regex to recognize words made of known phonemes

  • _max (frozenset) – the identifier of the supremum in the lattice

  • _regexes (dict) – a dictionnary of sound IDs to regex strings

  • _pretty_str (dict) – a dictionnary of sound IDs to pretty formatted strings

  • _features (dict) – a dictionnary of sound IDs to set of features

  • _features_str (dict) – a dictionnary of sound IDs to a string representing features

  • _classes (dict) – a dictionnary of sound IDs to a list of classes (ancestors)

classmethod calc_shorthand(lattice, shorthands)[source]

Calculate shorthand names for some lattice nodes.

Ex: ##C## in the sounds table might be a shorthand “C” for all consonants.

Parameters:
  • lattice – concept lattice

  • shorthands – table or shorthand definitions

Returns:

a dictionary of intents to their shorthand names

check_validity(lattice, table)[source]

Check validity of this sound inventory for Qumin.

Identifies when some segments are ancestors of others (some segments only differ from others through underspecification)

Parameters:
  • lattice – concept lattice

  • table – table of sounds

Raises: Exception

features(sound, **kwargs)[source]

Returns a set of features representing a sound.

Parameters:

sound – identifier of a sound

Returns:

features

Return type:

(set)

features_str(sound, **kwargs)[source]

Returns a string which described the features of a sound.

Parameters:

sound – identifier of a sound

Returns:

features string

Return type:

(str)

classmethod from_file(filename)[source]

Initializes the inventory

Parameters:

filename – path to a csv or tsv file with distinctive features

get(descriptor)[source]

Get a sound using the lattice.

Parameters:

descriptor – iterable of phonemes OR iterable of features

Returns: (str or frozenset) sound identifier

get_from_transform(a, transform)[source]

Get a segment from another according to a transformation tuple.

Parameters:
  • a (str) – Segment alias

  • transform (tuple) – Couple of two segment IDs

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> inv.get_from_transform("d",
...                                     (frozenset({"d","t"}),
...                                     frozenset({"s","z"})))
'z'
get_transform_features(left, right)[source]

Get the features corresponding to a transformation.

Parameters:

Example

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> inv.get_transform_features({"b","d"}, {"p","t"})
(frozenset({'+voi'}), frozenset({'-voi'}))
id_to_frozenset(sound_id)[source]
inf(a, b)[source]

Checks if a is a descendant of b.

a < b iff b has children and either a is a string which is part of b, or a is a subset of b.

infos(sound)[source]

String giving all useful information on a sound.

Parameters:

sound – identifier of a sound

Returns:

pretty string and features of a sound.

init_dissimilarity_matrix(gap_prop=0.5, **kwargs)[source]

Computes score matrix with dissimilarity scores.

insert_cost(*_)[source]

Returns the constant insertion/deletion cost

static is_leaf(sound)[source]

Returns whether this sound is a leaf (a phoneme, rather than a sound class)

Parameters:

sound – identifier of a sound

Returns:

meet(*args)[source]

Finds the lowest common ancestors of segments from their identifiers.

Args: several sound identifiers

Returns:

lowest common ancestor identifier

pretty_str(sound, **kwargs)[source]

Returns a pretty string representing a sound.

Parameters:

sound – identifier of a sound

Returns:

pretty string

Return type:

(str)

classmethod read_sounds_file(filename)[source]

Read a sound file from file.

Parameters:

filename – path to the file

Returns: (table, shorthands, normalization)

table (pd.DataFrame): normalized sound table shorthands (pd.DataFrame): table holding shorthands for some concepts normalization (dict): dictionary of normalizations for identical rows.

regex(sound)[source]

Returns a regex representing a sound.

Parameters:

sound – identifier of a sound

Returns:

regex string

Return type:

(str)

segment_form(wordform, resegment=False)[source]

Segment a form into phonemes (either following spaces, or using sound inventory)

Parameters:
  • wordform (str) – phonemic form

  • resegment (bool) – Whether to ignore spaces in phon forms and re-compute phonemic segmentation

Returns:

list of phonemes

shortest(sound, **kwargs)[source]

Returns a string which describes the sound in as little characters as possible.

Parameters:

sound – identifier of a sound

Returns:

short string

Return type:

(str)

show_pool()[source]

Return a string description of the whole segment pool.

similarity(a, b)[source]

Computes phonological similarity (Frisch, 2004)

Measure from “Similarity avoidance and the OCP” , Frisch, S. A.; Pierrehumbert, J. B. & Broe, M. B. Natural Language & Linguistic Theory, Springer, 2004, 22, 179-228, p. 198.

We compute similarity by comparing the number of shared and unshared natural classes of two consonants, using the equation in (7). This equation is a direct extension of the Pierrehumbert (1993) feature similarity metric to the case of natural classes.

  1. \(Similarity = \frac{\text{Shared natural classes}}{\text{Shared natural classes } + \text{Non-shared natural classes}}\)

sub_cost(a, b)[source]

Returns the cost of aligning sounds a and b

Parameters:
  • a – sound identifier

  • b – sound identifier

Returns: (float): substitution cost

transformation(a, b)[source]

Find a transformation between a and b.

The transformation is a pair of two maximal sets of segments related by a bijective phonological function.

This function takes a pair of sound identifiers and calculates the function which relates these two segments. It then finds and returns the two maximal sets of segments related by this function.

Example

In French, t -> s can be expressed by a phonological function which changes [-cont] and [-rel. ret] to [+cont] and [+rel. ret]

These other segments are related by the same change: d -> z b -> v p -> f

>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
>>> a,b = inv.transformation("t","s")
>>> a == frozenset({'d', 't', 'b', 'p'})
True
>>> b == frozenset({'s', 'z', 'f', 'v'})
True
Parameters:
  • a (str) – Segment identifiers.

  • b (str) – Segment identifiers.

Returns:

two sets of sounds.

Return type:

tuple of frozenset

qumin.representations.segments.normalize(ipa, features)[source]

Assign a normalized segment to groups of segments with identical rows.

This function takes a segments table and adds in place a “Normalized” column. This column contains a common value for each segment with identical boolean values. The function also returns a translation table mapping indexes to normalized segments.

Note: the index are expected to be one char length.

Index

..features..

Normalized

ɛ

[…]

E

e

[…]

E

Parameters:
  • ipa (pandas.DataFrame) – Dataframe of segments. Columns are features, UNICODE code point representation and segment names, indexes are segments.

  • features (list) – Feature columns’ names.

Returns:

translation table from

the segment’s name to its normalized name.

Return type:

dict

qumin.representations.segments.shorten_feature_names(table)[source]
qumin.representations.segments.sound_lattice_context(dataframe)[source]

Create a Context from a dataframe of properties.

Parameters:

dataframe (pandas.DataFrame) – A dataframe of sound / features incidence. -1 means unapplicable 1 means +feature 0 means -feature

Returns:

the created Context

Return type:

concepts.Context

Module contents

author: Sacha Beniamine.

Utility functions for representations.

qumin.representations.create_features(md, feature_cols)[source]

Read feature and preprocess to be coindexed with paradigms.