qumin.predictability package¶

Submodules¶

qumin.predictability.distribution module¶

author: Sacha Beniamine.

Encloses distribution of patterns on paradigms.

class qumin.predictability.distribution.PatternDistribution(patterns, dataset, frequencies, features=None)[source]¶

Bases: object

Statistical distribution of patterns.

Variables:

patterns (ParadigmPatterns) – A dict of pandas.DataFrame, where each row describes an alternation between two cells forms belonging to different cells of the same lexeme. The row also contains the correct pattern and the set of applicable patterns.
data (dict[int, pandas.DataFrame]) – dict mapping n to a dataframe containing the entropies for the distribution \(P(c_{1}, ..., c_{n} → c_{n+1})\).
name (str) – Name of the dataset.

__init__(patterns, dataset, frequencies, features=None)[source]¶

Constructor for PatternDistribution.

Parameters:

patterns (ParadigmPatterns) – A dict of pandas.DataFrame, where each row describes an alternation between forms belonging to two different cells of the same lexeme. The row also contains the correct pattern and the set of applicable patterns.
dataset (frictionless.Package) – Paralex dataset metadata.
frequencies (Frequencies) – The frequencies for the paradigms.
features – optional known features for conditional probabilities.

add_features(group)[source]¶

Adds features if available to a DataFrame containing a column named “applicable” and forms in the “form_x” column.

Parameters:: group (pandas.DataFrame) – a dataframe of lexemes and applicable patterns.

add_measures(*args, **kwargs)[source]¶

Adds data to the existing measures.

Parameters:

args (pandas.DataFrame) – DataFrames to add.
kwargs – optional keyword arguments to pass to pandas.concat().

check_zeros(n)[source]¶

Check whether: - We computed entropies for n-1 predictors - Some of these are 0s and don’t need to be computed for n predictors.

Parameters:: n (int) – number of predictors currently computed.
Returns:: a dictionary of pairs that lead to an entropy of zero.
Return type:: dict

cond_metrics_log(md, group, classes, cells, subset=None, legacy=False)[source]¶

Calculate the entropy, while also keeping a log of the probability distributions (for one predictor).

Writes down the distributions \(P( patterns_{c1, c2} | classes_{c1, c2} )\) for all unordered pairs of columns in patterns. Also writes the entropy of the distributions.

Parameters:

md (qumin.utils.Metadata) – Metadata handler for this run.
group
classes
cells
subset
legacy

Returns: average entropy for this pair of cells

export_file(filename)[source]¶

Export the data DataFrame to file

Parameters:: filename – the file’s path.

get_mean(**kwargs)[source]¶

Returns the average measures from the current run. If cell frequencies are available, they will be used.

Parameters:: **kwargs – Keyword arguments are passed to get_results()

Returns: mean (pandas.Series)

get_results(measure=None, n=1)[source]¶

Returns computation results from a distribution of patterns.

Parameters:

measure (List or str) – measure name. Defaults to all.
n (int) – Number of predictors to include in the mean.

Returns:

a DataFrame of results.

Return type:

pandas.DataFrame

get_weights(data, tokens=False)[source]¶

Returns weights computed from cell frequencies for pairs of cells.

The probability of a pair of cells is the product of the probability of the predictors with the probability of the target. The target is chosen different from the predictors.

Let \(\{A_1, \dots A_n\}\) be the random variables describing the drawing of \(n\) predictors and \(B\) the random variable describing the drawing of a target cell.

One can write the following generic formula and rewrite it with Bayes’ theorem:: \[\begin{split}\begin{align} P(\textrm{pred} \to \textrm{target}) &= P(\{A_1, \dots A_n\} = \textrm{pred}) P(B = \textrm{target}\mid B \notin \textrm{pred})\\ &= P(\{A_1, \dots A_n\} = \textrm{pred}) \frac{P(B = \textrm{target}\cap B \notin \textrm{pred})}{P(B \notin \textrm{pred})} \end{align}\end{split}\]
Since \(B = \textrm{target} \subset B \notin \textrm{pred}\), one can write:: \[\begin{split}\begin{align} P(\textrm{pred} \to \textrm{target}) &= P(\{A_1, \dots A_n\} = \textrm{pred}) \frac{P(B = \textrm{target})}{P(B \notin \textrm{pred})}\\ &= P(\{A_1, \dots A_n\} = \textrm{pred}) \frac{P(B = \textrm{target})}{1 - \sum_{a\in \textrm{pred}}P(a)} \end{align}\end{split}\]

We now need to estimate \(P(\textrm{pred})\) and \(P(\textrm{target})\). Let \(f_i\) be the frequency of cell \(i\) and \(f\) the cumulated frequency of all cells.

In the simplest case with one predictor, the formula can be simplified to:: \[\begin{split}\begin{align} P(\textrm{pred} \to \textrm{target}) &= P(A = \textrm{pred}) \frac{P(B = \textrm{target})}{P(B \neq \textrm{pred})}\\ &= \frac{f_\textrm{pred} f_\textrm{target}}{f f_\overline{\textrm{pred}}} \end{align}\end{split}\]

In the more complex case with n predictor, we need to estimate \(P(\{A_1, \dots A_n\} = \textrm{pred})\).

Let us consider:

\(S\) the set of all cells,
\(C^n_S\) the set of all unordered combinations of \(k\) cells.

For instance, if \(S=\{A, B, C\}\) and \(n=2\), then:: \[C^n_S = \{\{A, B\}, \{A, C\}, \{B, C\}\}\]

If we draw random combinations of \(n\) cells, how often are we going to draw each item of \(C^k_S\)?

This value is:: \[\begin{split}\begin{align} P(\{A_1, \dots A_n\} = \textrm{pred}) &= \frac{n!\prod_{a\in \textrm{pred}}P(a)} {\sum_{c\in C_S^n} n! \prod_{a\in c}P(a)} \\ &= \frac{\prod_{a\in \textrm{pred}} f_a}{\sum_{c\in C_S^n} \prod_{a\in c}f_a} \end{align}\end{split}\]
Finally:: \[\begin{split}\begin{align} P(\textrm{pred} \to \textrm{target}) &= P(\{A_1, \dots A_n\} = \textrm{pred}) \frac{P(B = \textrm{target})}{1 - \sum_{a\in \textrm{pred}}P(a)}\\ &= \frac{f_\textrm{target}}{f}\Big(\frac{\prod_{a\in \textrm{pred}} f_a} {\sum_{c\in C_S^n} \prod_{a\in c}f_a \times (1-\sum_{a\in \textrm{pred}}\frac{f_a}{f})}\Big) \end{align}\end{split}\]

Notice that the second part of the final formula does not depend on \(B\) and can be computed for the predictors beforehand. We then compute the product of this with the relative frequency of the target cell.

Parameters:

data (pandas.DataFrame) – the full computation results.
tokens (boolean) – Whether the cell token frequencies should be used for weighting. Defaults to False.

Returns:

two arrays containing the probability of the pairs and the: probability of the predictors (numpy:numpy.ndarray)

import_file(filename)[source]¶

Read already computed entropies from a file.

Parameters:: filename (str) – the file’s path.

n_preds_condent(df, paradigms, zeros, n)[source]¶

Computes the probability distribution for n predictors.

Writes down the distributions:

\[P( patterns_{c1, c3}, \; \; patterns_{c2, c3} \; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]

The result contains entropy \(H(c_{1}, ..., c_{n} \\to c_{n+1} )\).

Values are computed for all unordered combinations of \((c_{1}, ..., c_{n+1})\) in the paradigms’s columns. Indexes are tuples \((c_{1}, ..., c_{n})\) and columns are the predicted cells \(c_{n+1}\).

Example

For three cells c1, c2, c3, (n=2) entropy of c1, c2 → c3, noted \(H(c_{1}, c_{2} \to c_{3})\) is:

\[H( patterns_{c1, c3}, \; \; patterns_{c2, c3}\; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]

Parameters:

n (int) – number of predictors.
df (pandas.DataFrame) – a DataFrame containing patterns and applicable patterns for pairs of forms.
paradigms (pandas.DataFrame) – a DataFrame of paradigms.
zeros (dict) – a dictionary of pairs that lead to an entropy of zero.
n – number of predictors

n_preds_condent_log(df, md, paradigms, n)[source]¶

Computes the probability distribution for n predictors and logs the details of the computations.

Writes down the distributions:

\[P( patterns_{c1, c3}, \; \; patterns_{c2, c3} \; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]

The result contains entropy \(H(c_{1}, ..., c_{n} \to c_{n+1} )\).

Example

For three cells c1, c2, c3, (n=2) entropy of c1, c2 → c3, noted \(H(c_{1}, c_{2} \to c_{3})\) is:

\[H( patterns_{c1, c3}, \; \; patterns_{c2, c3}\; \; | classes_{c1, c3}, \; \; \; \; classes_{c2, c3}, \; \; patterns_{c1, c2} )\]

Parameters:

n (int) – number of predictors.
df (pandas.DataFrame) – a DataFrame containing patterns and applicable patterns for pairs of forms.
paradigms (pandas.DataFrame) – a DataFrame of paradigms.
n – number of predictors

n_preds_entropy(md, n, paradigms, export_log=False)[source]¶

Wrapper to prepare the computation of n-ary entropies.

Loops through the cells and runs the computations for every set of predictors.

Parameters:

n (int) – number of predictors.
paradigms (pandas.DataFrame) – a DataFrame of paradigms
export_log (bool) – Whether to export a full log. Defaults to False. Note that this uses a much less optimized computation.

one_pred_metrics(md, legacy=False, export_log=False, **kwargs)[source]¶

Return a pandas.DataFrame with unary entropies and counts of lexemes.

The result contains entropy \(H(c_{1} \to c_{2})\).

Values are computed for all unordered pairs of columns \((c_{1}, c_{2})\) where \(c_{1} != c_{2}\) in the PatternDistribution.patterns’s keys.

Example

For two cells c1, c2, entropy of c1 → c2, noted \(H(c_{1} \to c_{2})\) is:

\[H( \textrm{patterns}_{c1, c2} | \textrm{classes}_{c1, c2} )\]

The probability distribution of the patterns, on which this entropy is computed, is established on the probability distribution of the pairs of forms that instanciate the pattern. For the mathematical formalism, refer to the appendix of Bouton & Bonami 2026.

Parameters:

md (qumin.utils.Metadata) – Metadata handler for this run.
export_log (bool) – Whether to export a debug log. Defaults to False
legacy (bool) – Whether to use legacy computations. This necessarily disables token frequencies.
kwargs (dict) – settings to retrieve frequencies.

prepare_data(n=1, legacy=False)[source]¶

Prepares the dataframe to store the results for an entropy computation.

Variables:

n (int) – number of predictors to consider

Returns:

a dataframe with the predictors and the predicted cells,: as well as some metadata.

Return type:

pandas.DataFrame

Module contents¶

qumin.predictability.P(x, weights=None, subset=None)[source]¶

Return the probability distribution of unique elements in a pandas.core.series.Series. The default is a Uniform probability distribution, where each token in `x` has the same probability. If weights are provided, they will be used as the probability of the tokens.

Example

>>> P(pd.Series(["A", "B", "B"]))
A    0.333333
B    0.666667
Name: proportion, dtype: float64
>>> P(pd.Series(["A", "B", "B"]), weights=pd.Series([2, 1, 1]))
A    0.5
B    0.5
dtype: float64

Parameters:

x (pandas.core.series.Series) – A series of data.
weights (pandas.core.series.Series) – A series of weights.
subset (Iterable) – Only give the distribution for a subset of values.

Returns:

A Series which index are x’s unique elements: and which values are their probability in x.

Return type:

pandas.core.series.Series

qumin.predictability.cond_P(A, B, subset=None)[source]¶

Return the conditional probability distribution P(A|B) for elements in two pandas.core.series.Series.

Parameters:

A (pandas.core.series.Series) – A series of data.
B (pandas.core.series.Series) – A series of data.
subset (Iterable) – Only give the distribution for a subset of values.

Returns:

A Series whith two indexes. The first index is from the elements of B, the second from the elements of A. The values are the P(A|B).

Return type:

pandas.core.series.Series

qumin.predictability.cond_entropy(A, B, **kwargs)[source]¶

Calculate the conditional entropy of A knowing B, two series of data points.: Presupposes that values in the series are of the same type, typically tuples.

Parameters:

A (pandas.core.series.Series) – A series of data.
B (pandas.core.series.Series) – A series of data.

Returns:

H(A|B)

qumin.predictability.cond_entropy_slow(df, classes, subset=None)[source]¶

Calculate the conditional entropy through a slower method (with iterations across all groups).

Parameters:

df (pandas.DataFrame) – the patterns distribution.
classes (pandas.Series) – the known features that are used to group the patterns.

Uses token frequencies to weight the patterns.

qumin.predictability.cond_psuccess(df, classes)[source]¶

Calculate the conditional probability of success.

Parameters:

df (pandas.DataFrame) – the patterns distribution.
classes (pandas.Series) – the known features that are used to group the patterns.

Uses token frequencies to weight the patterns.

qumin.predictability.entropy(A)[source]¶

Calculate the entropy for a series of probabilities.

Since some probabilities may be null, we keep only positive values. This does not affect the result of the computation.

Parameters:: A (pandas.core.series.Series) – A series of numeric values.
Returns:: H(A)