qumin.predictability package¶
Submodules¶
qumin.predictability.distribution module¶
author: Sacha Beniamine.
Encloses distribution of patterns on paradigms.
- class qumin.predictability.distribution.PatternDistribution(patterns, dataset, frequencies, features=None)[source]¶
Bases:
objectStatistical distribution of patterns.
- Variables:
patterns (ParadigmPatterns) – A dict of
pandas.DataFrame, where each row describes an alternation between two cells forms belonging to different cells of the same lexeme. The row also contains the correct pattern and the set of applicable patterns.data (dict[int, pandas.DataFrame]) – dict mapping n to a dataframe containing the entropies for the distribution \(P(c_{1}, ..., c_{n} → c_{n+1})\).
name (str) – Name of the dataset.
- __init__(patterns, dataset, frequencies, features=None)[source]¶
Constructor for PatternDistribution.
- Parameters:
patterns (ParadigmPatterns) – A dict of
pandas.DataFrame, where each row describes an alternation between forms belonging to two different cells of the same lexeme. The row also contains the correct pattern and the set of applicable patterns.dataset (frictionless.Package) – Paralex dataset metadata.
frequencies (Frequencies) – The frequencies for the paradigms.
features – optional table of features
- add_features(group)[source]¶
Adds lexeme features if available to a DataFrame containing a column named “applicable” and lexemes as indexes.
- Parameters:
group (pandas.DataFrame) – a dataframe of lexemes and applicable patterns.
- add_measures(*args, **kwargs)[source]¶
Adds data to the existing measures.
- Parameters:
args (
pandas.DataFrame) – DataFrames to add.kwargs – optional keyword arguments to pass to pandas.concat().
- check_zeros(n)[source]¶
Check whether: - We computed entropies for n-1 predictors - Some of these are 0s and don’t need to be computed for n predictors.
- cond_metrics_log(group, classes, cells, subset=None, legacy=False)[source]¶
Print a log of the probability distribution for one predictor.
Writes down the distributions \(P( patterns_{c1, c2} | classes_{c1, c2} )\) for all unordered pairs of columns in
patterns. Also writes the entropy of the distributions.
- export_file(filename)[source]¶
Export the data DataFrame to file
- Parameters:
filename – the file’s path.
- get_mean(**kwargs)[source]¶
Returns the average measures from the current run. If cell frequencies are available, they will be used.
- Parameters:
**kwargs – Keyword arguments are passed to get_results()
Returns: mean (pandas.Series)
- get_results(measure=None, n=1)[source]¶
Returns computation results from a distribution of patterns.
- Parameters:
- Returns:
a DataFrame of results.
- Return type:
- get_weights(data, tokens=False)[source]¶
Returns weights computed from cell frequencies for pairs of cells.
The probability of a pair of cells is the product of the probability of the predictors with the probability of the target. The target is chosen different from the predictors.
Let \(\{A_1, \dots A_n\}\) be the random variables describing the drawing of \(n\) predictors and \(B\) the random variable describing the drawing of a target cell.
- One can write the following generic formula and rewrite it with Bayes’ theorem:
- \[\begin{split}\begin{align} P(\textrm{pred} \to \textrm{target}) &= P(\{A_1, \dots A_n\} = \textrm{pred}) P(B = \textrm{target}\mid B \notin \textrm{pred})\\ &= P(\{A_1, \dots A_n\} = \textrm{pred}) \frac{P(B = \textrm{target}\cap B \notin \textrm{pred})}{P(B \notin \textrm{pred})} \end{align}\end{split}\]
- Since \(B = \textrm{target} \subset B \notin \textrm{pred}\), one can write:
- \[\begin{split}\begin{align} P(\textrm{pred} \to \textrm{target}) &= P(\{A_1, \dots A_n\} = \textrm{pred}) \frac{P(B = \textrm{target})}{P(B \notin \textrm{pred})}\\ &= P(\{A_1, \dots A_n\} = \textrm{pred}) \frac{P(B = \textrm{target})}{1 - \sum_{a\in \textrm{pred}}P(a)} \end{align}\end{split}\]
We now need to estimate \(P(\textrm{pred})\) and \(P(\textrm{target})\). Let \(f_i\) be the frequency of cell \(i\) and \(f\) the cumulated frequency of all cells.
- In the simplest case with one predictor, the formula can be simplified to:
- \[\begin{split}\begin{align} P(\textrm{pred} \to \textrm{target}) &= P(A = \textrm{pred}) \frac{P(B = \textrm{target})}{P(B \neq \textrm{pred})}\\ &= \frac{f_\textrm{pred} f_\textrm{target}}{f f_\overline{\textrm{pred}}} \end{align}\end{split}\]
In the more complex case with n predictor, we need to estimate \(P(\{A_1, \dots A_n\} = \textrm{pred})\).
Let us consider:
\(S\) the set of all cells,
\(C^n_S\) the set of all unordered combinations of \(k\) cells.
- For instance, if \(S=\{A, B, C\}\) and \(n=2\), then:
- \[C^n_S = \{\{A, B\}, \{A, C\}, \{B, C\}\}\]
If we draw random combinations of \(n\) cells, how often are we going to draw each item of \(C^k_S\)?
- This value is:
- \[\begin{split}\begin{align} P(\{A_1, \dots A_n\} = \textrm{pred}) &= \frac{n!\prod_{a\in \textrm{pred}}P(a)} {\sum_{c\in C_S^n} n! \prod_{a\in c}P(a)} \\ &= \frac{\prod_{a\in \textrm{pred}} f_a}{\sum_{c\in C_S^n} \prod_{a\in c}f_a} \end{align}\end{split}\]
- Finally:
- \[\begin{split}\begin{align} P(\textrm{pred} \to \textrm{target}) &= P(\{A_1, \dots A_n\} = \textrm{pred}) \frac{P(B = \textrm{target})}{1 - \sum_{a\in \textrm{pred}}P(a)}\\ &= \frac{f_\textrm{target}}{f}\Big(\frac{\prod_{a\in \textrm{pred}} f_a} {\sum_{c\in C_S^n} \prod_{a\in c}f_a \times (1-\sum_{a\in \textrm{pred}}\frac{f_a}{f})}\Big) \end{align}\end{split}\]
Notice that the second part of the final formula does not depend on \(B\) and can be computed for the predictors beforehand. We then compute the product of this with the relative frequency of the target cell.
- Parameters:
data (pandas.DataFrame) – the full computation results.
tokens (boolean) – Whether the cell token frequencies should be used for weighting. Defaults to False.
- Returns:
- two arrays containing the probability of the pairs and the
probability of the predictors (
numpy:numpy.ndarray)
- import_file(filename)[source]¶
Read already computed entropies from a file.
- Parameters:
filename (str) – the file’s path.
- n_preds_condent(df, paradigms, zeros, n)[source]¶
Computes the probability distribution for n predictors.
Writes down the distributions:
\[P( patterns_{c1, c3}, \; \; patterns_{c2, c3} \; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]The result contains entropy \(H(c_{1}, ..., c_{n} \\to c_{n+1} )\).
Values are computed for all unordered combinations of \((c_{1}, ..., c_{n+1})\) in the
paradigms’s columns. Indexes are tuples \((c_{1}, ..., c_{n})\) and columns are the predicted cells \(c_{n+1}\).Example
For three cells c1, c2, c3, (n=2) entropy of c1, c2 → c3, noted \(H(c_{1}, c_{2} \to c_{3})\) is:
\[H( patterns_{c1, c3}, \; \; patterns_{c2, c3}\; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]- Parameters:
n (int) – number of predictors.
df (pandas.DataFrame) – a DataFrame containing patterns and applicable patterns for pairs of forms.
paradigms (pandas.DataFrame) – a DataFrame of paradigms.
zeros (dict) – a dictionary of pairs that lead to an entropy of zero.
n – number of predictors
- n_preds_condent_log(df, paradigms, n)[source]¶
Computes the probability distribution for n predictors and logs the details of the computations.
Writes down the distributions:
\[P( patterns_{c1, c3}, \; \; patterns_{c2, c3} \; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]The result contains entropy \(H(c_{1}, ..., c_{n} \to c_{n+1} )\).
Values are computed for all unordered combinations of \((c_{1}, ..., c_{n+1})\) in the
paradigms’s columns. Indexes are tuples \((c_{1}, ..., c_{n})\) and columns are the predicted cells \(c_{n+1}\).Example
For three cells c1, c2, c3, (n=2) entropy of c1, c2 → c3, noted \(H(c_{1}, c_{2} \to c_{3})\) is:
\[H( patterns_{c1, c3}, \; \; patterns_{c2, c3}\; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]- Parameters:
n (int) – number of predictors.
df (pandas.DataFrame) – a DataFrame containing patterns and applicable patterns for pairs of forms.
paradigms (pandas.DataFrame) – a DataFrame of paradigms.
n – number of predictors
- n_preds_entropy(n, paradigms, debug=False)[source]¶
Wrapper to prepare the computation of n-ary entropies.
Loops through the cells and runs the computations for every set of predictors.
- Parameters:
n (int) – number of predictors.
paradigms (pandas.DataFrame) – a DataFrame of paradigms
debug (bool) – Whether to run a debug computation with full log. Defaults to False.
- one_pred_metrics(legacy=False, debug=False, **kwargs)[source]¶
Return a
pandas.DataFramewith unary entropies and counts of lexemes.The result contains entropy \(H(c_{1} \to c_{2})\).
Values are computed for all unordered pairs of columns \((c_{1}, c_{2})\) where \(c_{1} != c_{2}\) in the
PatternDistribution.patterns’s keys.Example
For two cells c1, c2, entropy of c1 → c2, noted \(H(c_{1} \to c_{2})\) is:
\[H( \textrm{patterns}_{c1, c2} | \textrm{classes}_{c1, c2} )\]The probability distribution of the patterns, on which this entropy is computed, is established on the probability distribution of the pairs of forms that instanciate the pattern. For the mathematical formalism, refer to the appendix of Bouton & Bonami 2026.
Module contents¶
- qumin.predictability.P(x, weights=None, subset=None)[source]¶
Return the probability distribution of unique elements in a
pandas.core.series.Series. The default is a Uniform probability distribution, where each token in `x` has the same probability. If weights are provided, they will be used as the probability of the tokens.Example
>>> P(pd.Series(["A", "B", "B"])) A 0.333333 B 0.666667 Name: proportion, dtype: float64 >>> P(pd.Series(["A", "B", "B"]), weights=pd.Series([2, 1, 1])) A 0.5 B 0.5 dtype: float64
- Parameters:
x (
pandas.core.series.Series) – A series of data.weights (
pandas.core.series.Series) – A series of weights.subset (Iterable) – Only give the distribution for a subset of values.
- Returns:
- A Series which index are x’s unique elements
and which values are their probability in x.
- Return type:
pandas.core.series.Series
- qumin.predictability.cond_P(A, B, subset=None)[source]¶
Return the conditional probability distribution P(A|B) for elements in two
pandas.core.series.Series.- Parameters:
A (
pandas.core.series.Series) – A series of data.B (
pandas.core.series.Series) – A series of data.subset (Iterable) – Only give the distribution for a subset of values.
- Returns:
A Series whith two indexes. The first index is from the elements of B, the second from the elements of A. The values are the P(A|B).
- Return type:
pandas.core.series.Series
- qumin.predictability.cond_entropy(A, B, **kwargs)[source]¶
- Calculate the conditional entropy of A knowing B, two series of data points.
Presupposes that values in the series are of the same type, typically tuples.
- Parameters:
A (
pandas.core.series.Series) – A series of data.B (
pandas.core.series.Series) – A series of data.
- Returns:
H(A|B)
- qumin.predictability.cond_entropy_slow(df, classes, subset=None)[source]¶
Calculate the conditional entropy through a slower method (with iterations across all groups).
- Parameters:
df (pandas.DataFrame) – the patterns distribution.
classes (pandas.Series) – the known features that are used to group the patterns.
Uses token frequencies to weight the patterns.
- qumin.predictability.cond_psuccess(df, classes)[source]¶
Calculate the conditional probability of success.
- Parameters:
df (pandas.DataFrame) – the patterns distribution.
classes (pandas.Series) – the known features that are used to group the patterns.
Uses token frequencies to weight the patterns.