Predictability

Reference

An early version of this software was used in Bonami and Beniamine (2016), and a more recent one in Beniamine, Bonami and Luís (2021). The latest implementation (including token frequencies and probability of success) is described in Bouton and Bonami (2026).

By default, this will start by computing patterns. To work with pre-computed patterns, pass the path to the pattern computation metadata with patterns=<path/to/metadata.json>.

Warning

Patterns and entropies computed with Qumin 2.0 are not importable in Qumin 3.0 due to a breaking change in the output format. When importing computation results, Qumin 3.0+ expects a path to the metadata.json file, which contains relative paths to the output files.

Predictability measures from one cell

/$ qumin action=pred data=<dataset.package.json>

Entropies for other number of predictors:

/$ qumin action=pred  pred.n=2 data=<dataset.package.json>
/$ qumin action=pred  pred.n="[2,3]" data=<dataset.package.json>

Warning

With n and N>2 the computation can get quite long on large datasets, and it might be better to run Qumin on a server. In addition, the probability of success is not supported when N>1.

Predictability measures with known features

Predicting with known lexeme-wise features (such as gender or inflection class) is also possible. This feature was used in Pellegrini (2023). To use features, pass the name of any column(s) from the lexemes table:

/$ qumin action=pred  pred.features=inflection_class patterns=<metadata.json> data=<dataset.package.json>
/$ qumin action=pred  pred.features="[inflection_class,gender]" patterns=<metadata.json> data=<dataset.package.json>

Using token frequencies to weight results

Qumin allows to weight results at several levels using type or token frequencies::

/$ qumin action=pred data=<dataset.package.json> pred.token_freq.patterns=True pred.token_freq.predictors=True

For more details on predictability measures and weighting, turn to the predictability How-To.

Full reference

Predictability measure options are under the pred namespace. Available options (see also the common options):

class qumin.utils.config.PredictabilityConfig(*, vis=True, n=<factory>, features=None, importResults=None, token_freq=<factory>)[source]

Configuration for entropy calculations.

Parameters:
  • vis (bool) – Whether to create a heatmap of the metrics and of interpredictability zones.

  • n (List[int]) – Compute entropy for prediction from with n predictors.

  • features (Any | None) – Feature column in the Lexeme table. Features will be considered known in conditional probabilities: P(X~Y|X,f1,f2…)

  • importResults (str | None) – Import previous entropy computation results. with any file, use to compute entropy heatmap with n-1 predictors, allows for acceleration on nPreds entropy computation.

  • token_freq (TokenFreqConfig) – Whether to use token frequencies for…

Values for the token_freq keyword:

class qumin.utils.config.TokenFreqConfig(*, patterns=False, predictors=False, overabundant=False, cells=False)[source]

Whether to use token frequencies for…

Parameters:
  • patterns (bool) – The probability of the patterns.

  • predictors (bool) – The probability of the predictor classes and of the predictor forms.

  • overabundant (bool) – The weighting of overabundant cellmates

  • cells (bool) – The weighting of the measures across different cells.

The default configuration for these keys looks like this:

pred:
  vis: true
  'n':
  - 1
  features: null
  importResults: null
  token_freq:
    patterns: false
    predictors: false
    overabundant: false
    cells: false

See the full Default configuration