How to refine predictability measures?¶
Original implementations of the Paradigm Cell Filling Problem often build on two assumptions:
Paradigms are canonical (i.e. each cell contains exactly one form),
Analogies are based on the type frequency of inflectional behaviours.
Starting from version 3.2, Qumin allows to compute predictability measures using more advanced options, based on an implementation described in Bouton & Bonami (2026).
Warning
These advanced options are incompatible with N-predictor computations.
Token frequencies¶
Qumin allows to consider token frequencies at four different levels, which can be set using the pred.token_freq key:
Frequency of the patterns (
pred.token_freq.patterns): This makes a crucial change. In a situation of prediction, it amounts to consider that the most important pattern is not the one that is instanciated by the most lexemes, but the one which is instanciated by the most frequent pairs of forms.Frequency of the predictor classes (
pred.token_freq.predictors): This only plays a role when computing the average predictability for a pair of cells. It gives more weights to those classes of words that are the most frequent, instead of those that contain the most lexemes.Frequency of overabundant cellmates (
pred.token_freq.predictors, defaults toTrue): see below.
Frequency of the cells (
pred.token_freq.cells): This only plays a role when computing the average predictability of the paradigm. Weighting by the frequency of the cells gives more weight to pairs of cells that are frequently encountered.
For instance:
qumin action=pred pred.token_freq.predictors=True pred.token_freq.cells=True
Qumin will always try to find frequencies in the forms table, in the lexemes table, in the cells table, and in the frequencies table.
Probability of success¶
Guzmán Naranjo (2020) introduced the idea that predictability relations can be assessed by measuring the probability of success. Moreover, in most situations, probability of success and entropy give a similar picture of the system.
Starting from version 3.2, Qumin’s default behaviour is to compute both entropy and probability of success for every computation.
Overabundance¶
Overabundance refers to the situation where a lexeme has more than one form in a given paradigm cell. Conditional entropy is highly sensitive to situations of overabundance, except when weighted with form-level frequencies (pred.token_freq.overabundant=True). For this reason, take the following principles into account when running computations on overabundant paradigms:
If no form-level frequencies are available, the probability of success is more informative than the entropy. The entropy is interesting too, but it has a very different meaning.
If form-level frequencies are available, entropy and probability tend to give a similar picture. Enable form-level token frequencies as follows:
qumin action=pred data=<my.package.json> pred.token_freq.overabundant=True
When form-level frequencies are available, it is recommended to set this keyword on True.
Tip
For more explanations on the theoretical and mathematical background behind these implementations, please refer to the following paper: Bouton & Bonami (2026).