qumin.representations package¶
Submodules¶
qumin.representations.alignment module¶
author: Sacha Beniamine.
This module is used to align sequences.
- qumin.representations.alignment.align_auto(s1, s2, insert_cost, sub_cost, distance_only=False, fillvalue='', **kwargs)[source]¶
Return all the best alignments of two words according to some edit distance matrix.
- Parameters:
s1 (str) – first word to align
s2 (str) – second word to align
insert_cost (Callable) – A function which takes one value and returns an insertion cost
sub_cost (Callable) – A function which takes two values and returns a substitution cost
distance_only (bool) – defaults to False. If True, returns only the best distance. If False, returns an alignment.
fillvalue – (optional) the value with which to pad when iterable have varying lengths. Default: “”.
- Returns:
Either an alignment (a list of list of zipped tuples), or a distance (if distance_only is True).
- qumin.representations.alignment.align_baseline(*args, **kwargs)[source]¶
Simple alignment intended as an inflectional baseline. (Albright & Hayes 2002)
single change, either suffixal, or suffixal, or infixal. This doesn’t work well when there is both a prefix and a suffix. Used as a baseline for evaluation of the auto-aligned patterns.
see “Modeling English Past Tense Intuitions with Minimal Generalization”, Albright, A. & Hayes, B. Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning - Volume 6, Association for Computational Linguistics, 2002, 58-69, page 2 :
“The exact procedure for finding a word-specific rule is as follows: given an input pair (X, Y), the model first finds the maximal left-side substring shared by the two forms (e.g., #mɪs), to create the C term (left side context). The model then exam- ines the remaining material and finds the maximal substring shared on the right side, to create the D term (right side context). The remaining material is the change; the non-shared string from the first form is the A term, and from the second form is the B term.”
Examples
>>> align_baseline("mɪs","mas") [('m', 'm'), ('ɪ', 'a'), ('s', 's')] >>> align_baseline("mɪs","mɪst") [('m', 'm'), ('ɪ', 'ɪ'), ('s', 's'), ('', 't')] >>> align_baseline("mɪs","amɪs") [('', 'a'), ('m', 'm'), ('ɪ', 'ɪ'), ('s', 's')] >>> align_baseline("mɪst","amɪs") [('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
- Parameters:
*args – any number of iterables >= 2
fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.
- Returns:
a list of zipped tuples.
- qumin.representations.alignment.align_left(*args, **kwargs)[source]¶
Align left all arguments (wrapper around zip_longest).
Examples
>>> align_left("mɪs","mas") [('m', 'm'), ('ɪ', 'a'), ('s', 's')] >>> align_left("mɪs","mɪst") [('m', 'm'), ('ɪ', 'ɪ'), ('s', 's'), ('', 't')] >>> align_left("mɪs","amɪs") [('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('', 's')] >>> align_left("mɪst","amɪs") [('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
- Parameters:
*args – any number of iterables >= 2
fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.
- Returns:
a list of zipped tuples, left aligned.
- qumin.representations.alignment.align_multi(*strings, **kwargs)[source]¶
Levenshtein-style alignment over arguments, two by two.
- qumin.representations.alignment.align_right(*iterables, **kwargs)[source]¶
Align right all arguments. Zip longest with right alignment.
Examples
>>> align_right("mɪs","mas") [('m', 'm'), ('ɪ', 'a'), ('s', 's')] >>> align_right("mɪs","mɪst") [('', 'm'), ('m', 'ɪ'), ('ɪ', 's'), ('s', 't')] >>> align_right("mɪs","amɪs") [('', 'a'), ('m', 'm'), ('ɪ', 'ɪ'), ('s', 's')] >>> align_right("mɪst","amɪs") [('m', 'a'), ('ɪ', 'm'), ('s', 'ɪ'), ('t', 's')]
- Parameters:
*iterables – any number of iterables >= 2
fillvalue – the value with which to pad when iterable have varying lengths. Default: “”.
- Returns:
a list of zipped tuples, right aligned.
- qumin.representations.alignment.commonprefix(*args)[source]¶
Given a list of strings, returns the longest common prefix
qumin.representations.contexts module¶
author: Sacha Beniamine.
This module implements patterns’ contexts, which are series of phonological restrictions.
- class qumin.representations.contexts.Context(segments, inv)[source]¶
Bases:
objectContext for an alternation pattern
qumin.representations.frequencies module¶
author: Jules Bouton.
Class for frequency management.
- class qumin.representations.frequencies.Frequencies(package, *args, source=False, **kwargs)[source]¶
Bases:
objectFrequency management for a Paralex dataset. Frequencies are built for forms, lexemes and cells.
- The parsed frequency columns or tables should conform to the Paralex principles:
An empty value means that there is no measure available
A zero value means that there is a measure, which is zero
When aggregating accross rows, any empty cell yields a uniform distribution for the whole set of rows, whereas zeros are taken into account. This behaviour can be disabled for some functions by passing skipna=True.
Examples
>>> p = fl.Package('tests/data/TestPackage/test.package.json') >>> f = Frequencies(p) >>> print(f.info().to_markdown()) | Table | Source | Records | Sum(f) | Mean(f) | |:--------|:------------|----------:|---------:|----------:| | forms | forms_table | 22 | 519 | 27.3158 | | lexemes | forms_table | 4 | 519 | 129.75 | | cells | forms_table | 4 | 519 | 129.75 |
- Variables:
p (frictionless.Package) – package to analyze
source (Dict[str, str]) – source used by default for each table. Contains either a value for the source field of a Paralex frequency table, or the name of the table used to extract the frequency.
forms (
pandas.DataFrame) – Table of frequency values associated to a form_id.lexemes (
pandas.DataFrame) – Table of frequency values associated to a lexeme_id.cells (
pandas.DataFrame) – Table of frequency values associated to a cell_id.
- __init__(package, *args, source=False, **kwargs)[source]¶
Constructor for Frequencies. We gather and store frequencies for forms, lexemes and cells. Behaviour is the following:
If force_uniform is True, we use the paradigms table to generate a Uniform distribution.
If not, we try to get a frequency column from the tables: form, lexemes, cell
If any of those is missing, we use the frequencies table.
If we can’t use the frequency table, we fall back to a uniform.
- col_names = ['lexeme', 'cell', 'form']¶
- drop_unused(paradigms)[source]¶
If the paradigms table implied some sampling / filtering, make sure that the frequencies are also sampled.
- get_absolute_freq(mean=False, group_on=False, skipna=False, **kwargs)[source]¶
Return the frequency of an item for a given source
The frequency of an item is defined as the sum of the frequencies of this item across all rows.
Examples
>>> p = fl.Package('tests/data/TestPackage/test.package.json') >>> f = Frequencies(p) >>> f.get_absolute_freq(filters={'lexeme':'q'}, group_on="index", skipna=True) form 11 12.0 12 6.0 14 20.0 18 NaN 23 20.0 Name: value, dtype: float64 >>> float(f.get_absolute_freq(filters={'lexeme':'q'})) nan >>> float(f.get_absolute_freq(filters={'cell':'third'}, mean=True, skipna=True)) 20.0 >>> f.get_absolute_freq(group_on=['lexeme']) lexeme k 203.0 p NaN q NaN s 63.0 Name: value, dtype: float64
- Parameters:
- Returns:
- a Series which contains the output values.
The index is either the original one, or the grouping columns.
- Return type:
pandas.Series
- get_relative_freq(group_on=False, uniform_duplicates=False, **kwargs)[source]¶
Returns the relative frequencies of a set of rows according to a set of grouping columns. If any of the values is empty, we generate a Uniform distribution for this group.
Note
To avoid long computations, we use C implementations. Unfortunately, skipna is not yet implemented in GroupBy.sum. For this reason, we use a more complex pipeline of C functions.
Examples
>>> p = fl.Package('tests/data/TestPackage/test.package.json') >>> f = Frequencies(p) >>> f.get_relative_freq(filters={'lexeme': 'p', 'cell':'first'}, group_on=["lexeme"])['result'].values array([0.05882353, 0.94117647]) >>> f.get_relative_freq(filters={'lexeme': 's', 'cell':'second'}, group_on=["lexeme"])['result'].values array([0., 1.]) >>> f.get_relative_freq(filters={'cell':"third"}, group_on=["cell"])['result'].values array([0.25, 0.25, 0.25, 0.25]) >>> f.get_relative_freq(filters={'lexeme':'p'}, group_on=["lexeme", "cell"])['result'].values array([0.05882353, 0.94117647, 1. , 1. , 1. ]) >>> f.get_relative_freq(filters={'lexeme':'s', 'cell': 'first'}, group_on=["lexeme", "cell"]).result.values array([0.33333333, 0.33333333, 0.33333333])
- Parameters:
- Returns:
- a DataFrame which contains a result column with the output value.
The index is the original one. The grouping columns are also provided.
- Return type:
pandas.DataFrame
- has_frequencies(table='forms')[source]¶
Returns True if the requested contains real frequencies.
- Parameters:
table (str) – name of the table to test.
- info()[source]¶
Returns a convenient DataFrame with summary statistics.
- Returns:
A summary of statistics about this Frequencies handler.
- Return type:
pandas.DataFrame
- p = None¶
- source = {'cells': None, 'forms': None, 'lexemes': None}¶
qumin.representations.generalize module¶
author: Sacha Beniamine.
This module is used to generalize pats contexts.
- qumin.representations.generalize.generalize_alt(patterns, inv)[source]¶
Use the generalized alternation, using features when possible rather than segments.
- qumin.representations.generalize.generalize_patterns(pats, inv)[source]¶
Generalize these patterns’ context.
- Parameters:
pats (Iterable[
Pattern]) – the patterns to generalizeinv – an Iventory instance
- Returns:
a new pattern
- Return type:
Pattern
- qumin.representations.generalize.incremental_generalize_patterns(pats, inv)[source]¶
Merge patterns incrementally as long as the pattern has the same coverage.
Attempt to merge each patterns two by two, and refrain from doing so if the pattern doesn’t match all the lexemes that lead to its inference. Also attempt to merge together patterns that have not been merged with others.
- Parameters:
pats – the patterns
inv – Inventory instance
- Returns:
a list of patterns, at best of length 1, at worst of the same length as the input.
- Return type:
List[
Pattern]
qumin.representations.paradigms module¶
author: Sacha Beniamine and Jules Bouton.
Paradigms class to represent paralex paradigms.
- class qumin.representations.paradigms.Paradigms(dataset, **kwargs)[source]¶
Bases:
objectParadigms with methods to normalize them, merge and restore columns, etc.
- __init__(dataset, **kwargs)[source]¶
Read paradigms data, and prepare it according to a Segment class pool.
- Parameters:
dataset (frictionless.Package) – paralex frictionless Package All characters occuring in the paradigms except the first column should be inventoried in this class.
kwargs – additional arguments passed to
Package.preprocess()
- Returns:
- paradigms table
(rows contain forms, lemmas, cells).
- Return type:
paradigms (
pandas.DataFrame)
- cells = None¶
- cells_dedup = None¶
- data = None¶
- default_cols = ('lexeme', 'cell', 'phon_form')¶
- preprocess(fillna=True, segcheck=True, defective=False, overabundant=False, cells=None, sample_lexemes=None, sample_cells=None, sample_kws=None, pos=None, resegment=False, lexemes_list=None, **kwargs)[source]¶
- Preprocess a Paralex paradigms table to meet the requirements of Qumin:
Filter by POS and by cells
Filter by frequency, sample
Filter overabundance and defectivity
Merge identical columns
Check segments and create Form() objects
- Parameters:
fillna (bool) – Defaults to True. Should #DEF# be replaced by np.NaN ? Otherwise they are filled with empty strings (“”).
segcheck (bool) – Defaults to True. Should I check that all the phonological segments in the table are defined in the segments table?
defective (bool) – Defaults to False. Should I keep rows with defective forms?
overabundant (bool) – Defaults to False. Should I keep rows with overabundant forms?
cells (List[str]) – List of cell names to consider. Defaults to all.
pos (List[str]) – List of parts of speech to consider. Defaults to all.
lexemes_list (path) – Path to a file containing one lexeme per row.
sample_lexemes (int) – Defaults to None. Should I sample n lexemes (for debug purposes)?
sample_cells (int) – Defaults to None. Should I sample n lexemes (for debug purposes)?
sample_kws (dict) – Dict of keywords passed to
_sample_paradigms().resegment (bool) – Defaults to False. Should I resegment the paradigms?
qumin.representations.patterns module¶
author: Sacha Beniamine.
This module addresses the modeling of inflectional alternation patterns.
- exception qumin.representations.patterns.NotApplicable[source]¶
Bases:
ExceptionRaised when a
Patterncan’t be applied to a form.
- class qumin.representations.patterns.Pattern(alternation, context, inv)[source]¶
Bases:
objectRepresent the alternation pattern between two forms.
Applying the pattern to one of the original forms yields the second one.
As an example, we will use the following alternation in a present verb of french:
cells
Forms
Transcription
prs.1.sg ⇌ prs.2.pl
j’amène ⇌ vous amenez
amEn ⇌ amənE
Example
>>> inv = Inventory.from_file("tests/data/frenchipa.csv") >>> cells = ("prs.1.sg", "prs.2.pl") >>> forms = (Form("a m E n"), Form("a m Ø n E")) >>> p = Pattern.from_forms(cells, forms, inv) >>> type(p) <class 'qumin.representations.patterns.Pattern'> >>> p E_ ⇌ Ø_E / am_n_ <0> >>> p.apply(Form("a m E n"), cells, inv) Form(a m Ø n E)
- applicable(form, cell)[source]¶
Test if this pattern matches a form, i.e. if the pattern is applicable to the form.
- apply(form, names, inv, raiseOnFail=True)[source]¶
Apply the pattern to a form.
- Parameters:
form – a form, assumed to belong to the cell names[0].
names – apply to a form of cell names[0] to produce a form of cell names[1] (default:self.cells). Patterns being non-oriented, it is better to use the names argument.
inv (segments.Inventory) – sound inventory
raiseOnFail (bool) – defaults to True. If true, raise an error when the pattern is not applicable to the form. If False, return None instead.
- Returns:
form belonging the opposite cell.
- classmethod from_aligned(cells, alignment, inv)[source]¶
Create a pattern fron aligned forms (aligns them left)
- Parameters:
cells (Iterable) – Cells labels (str), in the same order.
alignment (Iterable) – Alogned foorms (str) to be segmented.
- classmethod from_forms(cells, forms, inv)[source]¶
Create a pattern fron unaligned forms (aligns them left)
Example
>>> inv = Inventory.from_file("tests/data/frenchipa.csv") >>> cells = ("prs.1.sg", "prs.2.pl") >>> forms = (Form("a m E n"), Form("a m Ø n E")) >>> p = Pattern.from_forms(cells, forms, inv) >>> p E_ ⇌ Ø_E / am_n_ <0> >>> p.score # is zero at initialization 0 >>> p.lexemes # is empty at initialization set() >>> p.alternation {'prs.1.sg': [('E',), ('',)], 'prs.2.pl': [('Ø',), ('E',)]} >>> p.context # this is a Context ((?:a )(?:m )){}((?:n )){} >>> p.cells ('prs.1.sg', 'prs.2.pl') >>> p._repr 'E_ ⇌ Ø_E / am_n_' >>> p._feat_str 'E_ ⇌ Ø_E / am_n_' >>> p._gen_alt == {'prs.1.sg': ((frozenset({'ɑ̃', 'ɛ̃', 'i', 'j', 'E'}),), ('',)), ... 'prs.2.pl': ((frozenset({'ɥ', 'ɔ̃', 'y', 'Ø', 'œ̃'}),), ('E',))} True
- classmethod from_str(cells, string, inv)[source]¶
Parse an exported pattern.
To be parsed back, patterns need to be exported by repr(), not str().
Note: Phonemes in context classes are now separated by “,”
- Parameters:
- Returns:
a parsed Pattern object.
- Return type:
Example
>>> inv = Inventory.from_file("tests/data/frenchipa.csv") >>> p = Pattern.from_str(('A', 'B'), "ɥ ⇌ yj / {E,O,a,b,d,f,g,i,j,k,l,m,n,p,s,t,u,v,w,y,z,Ø,ŋ,œ̃,ɑ̃,ɔ̃,ɛ̃,ɥ,ɲ,ʁ,ʃ,ʒ}*{b,d,f,g,k,l,m,n,p,s,t,v,z,ŋ,ɲ,ʁ,ʃ,ʒ}_E <58>", inv) >>> type(p) is Pattern True >>> str(p) 'ɥ ⇌ yj / X*C_E' >>> p ɥ ⇌ yj / {E,O,a,b,d,f,g,i,j,k,l,m,n,p,s,t,u,v,w,y,z,Ø,ŋ,œ̃,ɑ̃,ɔ̃,ɛ̃,ɥ,ɲ,ʁ,ʃ,ʒ}*{b,d,f,g,k,l,m,n,p,s,t,v,z,ŋ,ɲ,ʁ,ʃ,ʒ}_E <58.0> >>> p = Pattern.from_str(('A','B'), "E_ ⇌ Ø_E / am_n_ <0>", inv) >>> type(p) is Pattern True >>> p E_ ⇌ Ø_E / am_n_ <0.0>
- is_identity()[source]¶
Checks whether this pattern is an identity pattern.
Example
>>> inv = Inventory.from_file("tests/data/frenchipa.csv") >>> p = Pattern.new_identity(("A", "B"), inv) >>> p.is_identity() True
- classmethod new_identity(cells, inv)[source]¶
Identity pattern factory.
The alternation is empty, and the context is a sequence of any number of allowed segments.
- Parameters:
cells – Pair of cell for this pattern.
inv (Inventory) – Sound Inventory.
- Returns:
a new identity pattern.
- Return type:
Example
>>> inv = Inventory.from_file("tests/data/frenchipa.csv") >>> print(Pattern.new_identity(('A','B'), inv)) ⇌ / X*
- to_alt(inv, exhaustive_blanks=True, use_gen=False, **kwargs)[source]¶
Build a string representing the alternation
Example
>>> inv = Inventory.from_file("tests/data/frenchipa.csv") >>> cells = ("prs.1.sg", "prs.2.pl") >>> forms = (Form("a m E n"), Form("a m Ø n E")) >>> p = Pattern.from_forms(cells, forms, inv) >>> p.alternation {'prs.1.sg': [('E',), ('',)], 'prs.2.pl': [('Ø',), ('E',)]} >>> p.to_alt(inv) '_E_ ⇌ _Ø_E' >>> p.to_alt(inv, exhaustive_blanks=False) 'E_ ⇌ Ø_E' >>> p.to_alt(inv, use_gen=True) '_[-arro]_ ⇌ _[+arro]_E'
- Parameters:
- Returns:
A string representing the alternation, with contexts positions replaced by the filler “_”.
qumin.representations.quantity module¶
author: Sacha Beniamine.
This module provides Quantity objects to represent quantifiers.
- class qumin.representations.quantity.Quantity(mini, maxi)[source]¶
Bases:
objectRepresents a quantifier as an interval.
This is a flyweight class and the presets are :
description
mini
maxi
regex symbol
variable name
Match one
1
1
quantity.one
Optional
0
1
?
quantity.optional
Some
1
inf
+
quantity.some
Any
0
inf
*
quantity.kleenestar
None
0
0
- qumin.representations.quantity.quantity_largest(args)[source]¶
Reduce on the “&” operator of quantities.
Returns a quantity with the minimum left value and maximum right value.
Example
>>> quantity_largest([Quantity(0,1),Quantity(1,1),Quantity(1,np.inf)]) Quantity(0,inf)
- Argument:
args: an iterable of quantities.
- qumin.representations.quantity.quantity_sum(args)[source]¶
Reduce on the “+” operator of quantities.
Returns a quantity with the minimum left value and the sum of the right value.
Example
>>> quantity_largest([Quantity(0,1),Quantity(1,1),Quantity(0,0)]) Quantity(0,1)
- Argument:
args: an iterable of quantities.
qumin.representations.segments module¶
author: Sacha Beniamine.
This module addresses the modelisation of phonological segments.
- class qumin.representations.segments.Form(contents, form_id=None)[source]¶
Bases:
strA form is a string of sounds, separated by spaces. If a form is provided as defective, this information is still stored as a Form object with empty content. Defectiveness can be tested with:
>>> inv = Inventory.from_file("tests/data/frenchipa.csv") >>> Form('').is_defective() True
By default, we segment by cutting on spaces. If resegment=True, we remove spaces, and segment using the sound inventory’s list of valid phonemes.
Sounds might be more than one character long. Forms are strings, they are segmented at the object creation.
- Variables:
tokens (Tuple) – Tuple of phonemes contained in this form. For defective entries, tokens are an empty tuple.
id (str) – form_id of the corresponding form according to the Paralex package. If unknown, None will be assigned.
- __init__(string, form_id=None)[source]¶
The constructor assumes everything is already clean and normalized
- classmethod from_raw(string, inventory, form_id=None, resegment=False)[source]¶
Use inventory to build a cleaned and normalized Form.
- Parameters:
string – raw string for this form
inventory (segments.Inventory) – sound inventory
form_id – form identifier
resegment (bool) – defaults to False. Whether to re-segment phoneme tokens.
Returns: a formatted Form
- class qumin.representations.segments.Inventory(table, shorthands_table, normalization)[source]¶
Bases:
objectThe static segments.Inventory class describes a sound inventory.
>>> inv = Inventory.from_file("tests/data/frenchipa.csv")
Each sound class in the inventory is a concept in a FCA lattice. Sound class identifiers are either strings (for phonemes) or frozensets (for sound classes). Phonemes are the leaves of the hierarchy.
Sound classes can be seen as under-determined phonemes, and both phonemes and sound classes are handled in the same way. For this reason, we call both “sound”.
- Variables:
context – the FCA context underlying the feature space
_score_matrix (dict) – a dictionnary of sound tuples to alignment score
_gap_score (float) – a score for insertions
_normalization (dict) – a dictionnary of sounds to their normalized counterparts
_segmenter (re.Pattern) – a compiled regex to segment words into phonemes
_legal_str (re.Pattern) – a compiled regex to recognize words made of known phonemes
_max (frozenset) – the identifier of the supremum in the lattice
_regexes (dict) – a dictionnary of sound IDs to regex strings
_pretty_str (dict) – a dictionnary of sound IDs to pretty formatted strings
_features (dict) – a dictionnary of sound IDs to set of features
_features_str (dict) – a dictionnary of sound IDs to a string representing features
_classes (dict) – a dictionnary of sound IDs to a list of classes (ancestors)
- classmethod calc_shorthand(lattice, shorthands)[source]¶
Calculate shorthand names for some lattice nodes.
Ex: ##C## in the sounds table might be a shorthand “C” for all consonants.
- Parameters:
lattice – concept lattice
shorthands – table or shorthand definitions
- Returns:
a dictionary of intents to their shorthand names
- check_validity(lattice, table)[source]¶
Check validity of this sound inventory for Qumin.
Identifies when some segments are ancestors of others (some segments only differ from others through underspecification)
- Parameters:
lattice – concept lattice
table – table of sounds
Raises: Exception
- features(sound, **kwargs)[source]¶
Returns a set of features representing a sound.
- Parameters:
sound – identifier of a sound
- Returns:
features
- Return type:
(set)
- features_str(sound, **kwargs)[source]¶
Returns a string which described the features of a sound.
- Parameters:
sound – identifier of a sound
- Returns:
features string
- Return type:
(str)
- classmethod from_file(filename)[source]¶
Initializes the inventory
- Parameters:
filename – path to a csv or tsv file with distinctive features
- get(descriptor)[source]¶
Get a sound using the lattice.
- Parameters:
descriptor – iterable of phonemes OR iterable of features
Returns: (str or frozenset) sound identifier
- get_from_transform(a, transform)[source]¶
Get a segment from another according to a transformation tuple.
Example
>>> inv = Inventory.from_file("tests/data/frenchipa.csv") >>> inv.get_from_transform("d", ... (frozenset({"d","t"}), ... frozenset({"s","z"}))) 'z'
- get_transform_features(left, right)[source]¶
Get the features corresponding to a transformation.
Example
>>> inv = Inventory.from_file("tests/data/frenchipa.csv") >>> inv.get_transform_features({"b","d"}, {"p","t"}) (frozenset({'+voi'}), frozenset({'-voi'}))
- inf(a, b)[source]¶
Checks if a is a descendant of b.
a < b iff b has children and either a is a string which is part of b, or a is a subset of b.
- infos(sound)[source]¶
String giving all useful information on a sound.
- Parameters:
sound – identifier of a sound
- Returns:
pretty string and features of a sound.
- init_dissimilarity_matrix(gap_prop=0.5, **kwargs)[source]¶
Computes score matrix with dissimilarity scores.
- static is_leaf(sound)[source]¶
Returns whether this sound is a leaf (a phoneme, rather than a sound class)
- Parameters:
sound – identifier of a sound
Returns:
- meet(*args)[source]¶
Finds the lowest common ancestors of segments from their identifiers.
Args: several sound identifiers
- Returns:
lowest common ancestor identifier
- pretty_str(sound, **kwargs)[source]¶
Returns a pretty string representing a sound.
- Parameters:
sound – identifier of a sound
- Returns:
pretty string
- Return type:
(str)
- classmethod read_sounds_file(filename)[source]¶
Read a sound file from file.
- Parameters:
filename – path to the file
- Returns: (table, shorthands, normalization)
table (pd.DataFrame): normalized sound table shorthands (pd.DataFrame): table holding shorthands for some concepts normalization (dict): dictionary of normalizations for identical rows.
- regex(sound)[source]¶
Returns a regex representing a sound.
- Parameters:
sound – identifier of a sound
- Returns:
regex string
- Return type:
(str)
- segment_form(wordform, resegment=False)[source]¶
Segment a form into phonemes (either following spaces, or using sound inventory)
- shortest(sound, **kwargs)[source]¶
Returns a string which describes the sound in as little characters as possible.
- Parameters:
sound – identifier of a sound
- Returns:
short string
- Return type:
(str)
- similarity(a, b)[source]¶
Computes phonological similarity (Frisch, 2004)
Measure from “Similarity avoidance and the OCP” , Frisch, S. A.; Pierrehumbert, J. B. & Broe, M. B. Natural Language & Linguistic Theory, Springer, 2004, 22, 179-228, p. 198.
We compute similarity by comparing the number of shared and unshared natural classes of two consonants, using the equation in (7). This equation is a direct extension of the Pierrehumbert (1993) feature similarity metric to the case of natural classes.
\(Similarity = \frac{\text{Shared natural classes}}{\text{Shared natural classes } + \text{Non-shared natural classes}}\)
- sub_cost(a, b)[source]¶
Returns the cost of aligning sounds a and b
- Parameters:
a – sound identifier
b – sound identifier
Returns: (float): substitution cost
- transformation(a, b)[source]¶
Find a transformation between a and b.
The transformation is a pair of two maximal sets of segments related by a bijective phonological function.
This function takes a pair of sound identifiers and calculates the function which relates these two segments. It then finds and returns the two maximal sets of segments related by this function.
Example
In French, t -> s can be expressed by a phonological function which changes [-cont] and [-rel. ret] to [+cont] and [+rel. ret]
These other segments are related by the same change: d -> z b -> v p -> f
>>> inv = Inventory.from_file("tests/data/frenchipa.csv") >>> a,b = inv.transformation("t","s") >>> a == frozenset({'d', 't', 'b', 'p'}) True >>> b == frozenset({'s', 'z', 'f', 'v'}) True
- qumin.representations.segments.normalize(ipa, features)[source]¶
Assign a normalized segment to groups of segments with identical rows.
This function takes a segments table and adds in place a “Normalized” column. This column contains a common value for each segment with identical boolean values. The function also returns a translation table mapping indexes to normalized segments.
Note: the index are expected to be one char length.
Index
..features..
Normalized
ɛ
[…]
E
e
[…]
E
- Parameters:
ipa (
pandas.DataFrame) – Dataframe of segments. Columns are features, UNICODE code point representation and segment names, indexes are segments.features (list) – Feature columns’ names.
- Returns:
- translation table from
the segment’s name to its normalized name.
- Return type:
- qumin.representations.segments.sound_lattice_context(dataframe)[source]¶
Create a Context from a dataframe of properties.
- Parameters:
dataframe (
pandas.DataFrame) – A dataframe of sound / features incidence. -1 means unapplicable 1 means +feature 0 means -feature- Returns:
the created Context
- Return type:
Module contents¶
author: Sacha Beniamine.
Utility functions for representations.