qumin.utils package

Submodules

qumin.utils.metadata module

class qumin.utils.metadata.Metadata(path=None, cfg=None, rundir_path=None)[source]

Bases: object

Metadata manager for Qumin scripts. Wrapper around the Frictionless Package class.

Basic usage :

  1. Register Metadata manager;

  2. Get an absolute path to the metadata folder;

  3. Write to that path;

  4. After writing a file, register it and set metadata (description, custom dict);

  5. Export the JSON descriptor.

The Metadata class can easily be used in scripts that reuse Qumin results. In that case, one must pass a value to runtime_path, if hydra is not set (which is very likely if you write a simple script).

Examples

import omegaconf
cfg = omegaconf.dictconfig.DictConfig({data="myparalex/package.json"})
md = Metadata(cfg=cfg, path="myprevious_run/metadata.json")
name = 'path/myfile.txt'
filename = md.get_path(name)
# Open an IO stream and write to ``filename``.
md.register_file(name, description="My nice file", custom={"property": "value"})
md.save_metadata(path)
Variables:
  • start (datetime) – timestamp at the beginning of the run.

  • prefix (Path) – normalized prefix for the output files

  • cfg (OmegaConf) – all arguments passed to the python script

  • paralex (frictionless.Package) – a frictionless Package representing a dataset.

__init__(path=None, cfg=None, rundir_path=None)[source]
Parameters:
  • cfg (OmegaConf.dictconfig.DictConfig) – arguments passed to the script.

  • path (str) – Path to a Frictionless descriptor of a previous run to be imported.

  • rundir_path (str) – Directory that should be used to export the results. Useful only if path is None and hydra is not used.

get_paradigm_conf(cfg)[source]

Load paradigm creation keywords from previous run.

A few security checks are performed to ensure the user didn’t pass contradictory arguments. If this is the case, a warning is thrown and old arguments are kept.

Under some conditions, arguments can be overwritten (e.g. cells list).

Parameters:

cfg (OmegaConf.dictconfig.DictConfig) – Arguments passed to the current run. These arguments might override arguments from the previous run, under specific conditions.

get_paradigms(md, **kwargs)[source]

Creates paradigms with a stable config strategy.

Parameters:
  • md (qumin.utils.metadata.Metadata) – Metadata handler of the current run.

  • kwargs (dict) – Additional keyword arguments are passed to qumin.representations.paradigms.Paradigms.

get_path(rel_path)[source]

Return an absolute path to a file and create parent directories.

Parameters:

rel_path (str) – relative path to the file or folder.

Returns:

absolute path to the file or folder.

Return type:

pathlib.Path

get_pattern_conf()[source]

Load pattern creation keywords from previous run. No security checks: all relevant arguments have already been tested when loading the paradigms.

get_patterns(paradigms, **kwargs)[source]

Creates patterns with a stable config strategy.

Parameters:
get_resource_path(resource)[source]

Return the full path to a resource

Parameters:

resource (str) – A resource name

Returns:

a path to the resource.

Return type:

pathlib.Path

get_table_path(table_name)[source]

Return the path to a dataset table

register_file(rel_path, name=None, custom=None, **kwargs)[source]

Add a file as a frictionless resource.

Parameters:
  • rel_path (str or pathlib.Path) – the relative path to the file.

  • name (str) – name of the resource. By default, this will be the name of the file without the extension.

  • custom (dict) – Custom properties to save.

  • **kwargs (dict) – Optional keyword arguments passed to Resource, e.g. description.

save_metadata()[source]

Save the metadata as a JSON file.

qumin.utils.metadata.diff(msg, old, new)[source]
qumin.utils.metadata.get_recursively(cfg, key)[source]

Module contents

qumin.utils.adjust_cpus(n)[source]
qumin.utils.memory_check(df, factor, max_gb=2, force=False)[source]

Checks memory usage for a dataframe and warn if it exceeds a certain limit.

Parameters:
  • df (pandas.DataFrame) – dataframe to test

  • factor (int) – multiplication factor for the test.

  • max_gb (float) – Threshold for memory warning.

  • force (bool) – whether to allow overpassing the limit. Defaults to False.

qumin.utils.metadata module

class qumin.utils.metadata.Metadata(path=None, cfg=None, rundir_path=None)[source]

Bases: object

Metadata manager for Qumin scripts. Wrapper around the Frictionless Package class.

Basic usage :

  1. Register Metadata manager;

  2. Get an absolute path to the metadata folder;

  3. Write to that path;

  4. After writing a file, register it and set metadata (description, custom dict);

  5. Export the JSON descriptor.

The Metadata class can easily be used in scripts that reuse Qumin results. In that case, one must pass a value to runtime_path, if hydra is not set (which is very likely if you write a simple script).

Examples

import omegaconf
cfg = omegaconf.dictconfig.DictConfig({data="myparalex/package.json"})
md = Metadata(cfg=cfg, path="myprevious_run/metadata.json")
name = 'path/myfile.txt'
filename = md.get_path(name)
# Open an IO stream and write to ``filename``.
md.register_file(name, description="My nice file", custom={"property": "value"})
md.save_metadata(path)
Variables:
  • start (datetime) – timestamp at the beginning of the run.

  • prefix (Path) – normalized prefix for the output files

  • cfg (OmegaConf) – all arguments passed to the python script

  • paralex (frictionless.Package) – a frictionless Package representing a dataset.

__init__(path=None, cfg=None, rundir_path=None)[source]
Parameters:
  • cfg (OmegaConf.dictconfig.DictConfig) – arguments passed to the script.

  • path (str) – Path to a Frictionless descriptor of a previous run to be imported.

  • rundir_path (str) – Directory that should be used to export the results. Useful only if path is None and hydra is not used.

get_paradigm_conf(cfg)[source]

Load paradigm creation keywords from previous run.

A few security checks are performed to ensure the user didn’t pass contradictory arguments. If this is the case, a warning is thrown and old arguments are kept.

Under some conditions, arguments can be overwritten (e.g. cells list).

Parameters:

cfg (OmegaConf.dictconfig.DictConfig) – Arguments passed to the current run. These arguments might override arguments from the previous run, under specific conditions.

get_paradigms(md, **kwargs)[source]

Creates paradigms with a stable config strategy.

Parameters:
  • md (qumin.utils.metadata.Metadata) – Metadata handler of the current run.

  • kwargs (dict) – Additional keyword arguments are passed to qumin.representations.paradigms.Paradigms.

get_path(rel_path)[source]

Return an absolute path to a file and create parent directories.

Parameters:

rel_path (str) – relative path to the file or folder.

Returns:

absolute path to the file or folder.

Return type:

pathlib.Path

get_pattern_conf()[source]

Load pattern creation keywords from previous run. No security checks: all relevant arguments have already been tested when loading the paradigms.

get_patterns(paradigms, **kwargs)[source]

Creates patterns with a stable config strategy.

Parameters:
get_resource_path(resource)[source]

Return the full path to a resource

Parameters:

resource (str) – A resource name

Returns:

a path to the resource.

Return type:

pathlib.Path

get_table_path(table_name)[source]

Return the path to a dataset table

register_file(rel_path, name=None, custom=None, **kwargs)[source]

Add a file as a frictionless resource.

Parameters:
  • rel_path (str or pathlib.Path) – the relative path to the file.

  • name (str) – name of the resource. By default, this will be the name of the file without the extension.

  • custom (dict) – Custom properties to save.

  • **kwargs (dict) – Optional keyword arguments passed to Resource, e.g. description.

save_metadata()[source]

Save the metadata as a JSON file.

qumin.utils.metadata.diff(msg, old, new)[source]
qumin.utils.metadata.get_recursively(cfg, key)[source]

qumin.utils.config module

class qumin.utils.config.Actions(*values)[source]

Bases: str, Enum

Available actions. Each action triggers a different script.

Changed in version 3.2.0: Actions H and ent_heatmap are replaced by pred and pred_heatmap.

H = 'H'
ent_heatmap = 'ent_heatmap'
heatmap = 'heatmap'
lattice = 'lattice'
macroclasses = 'macroclasses'
patterns = 'patterns'
pred = 'pred'
pred_heatmap = 'pred_heatmap'
class qumin.utils.config.HeatmapConfig(*, label=None, cmap=None, exhaustive_labels=False, dense=False, annotate=False, order=None, cols=None, display=<factory>)[source]

Bases: object

Parameters:
  • label (str | None) – Lexeme column to use as label (for microclass heatmap, eg. inflection_class)

  • cmap (str | None) – Colormap name

  • exhaustive_labels (bool) – by default, seaborn shows only some labels on the heatmap for readability. This forces seaborn to print all labels.

  • dense (bool) – Use initials instead of full labels (only for entropy heatmap)

  • annotate (bool) – Display values on the heatmap. (only for entropy heatmap)

  • order (Any | None) – Priority list for sorting features (for entropy heatmap) ex: [number, case]). If no features-values file available, you can use the key cells to provide an ordered list of cells to display. Special value “autosort” in order to sort by cell similarity.

  • cols (Any | None) – List of features to show in columns (for zones heatmap) ex: [Mode, Tense]). All other features will constitute rows.

  • display (HeatmapDisplayConfig) – Options to switch on/off additional heatmaps.

annotate: bool = False
cmap: str | None = None
cols: Any | None = None
dense: bool = False
display: HeatmapDisplayConfig
exhaustive_labels: bool = False
label: str | None = None
order: Any | None = None
class qumin.utils.config.HeatmapDisplayConfig(*, n_pairs=True, freq_margins=True)[source]

Bases: object

Set to True/False to show or hide detailed information on the heatmap

Parameters:
  • n_pairs (bool) – Whether to display the number of pairs.

  • freq_margins (bool) – Whether to display frequency margins on heatmaps.

freq_margins: bool = True
n_pairs: bool = True
class qumin.utils.config.Kind(*values)[source]

Bases: str, Enum

Kind of algorithm for the patterns.

Parameters:
  • phon – phonological distance

  • edits – simple edit distance

edits = 'edits'
phon = 'phon'
class qumin.utils.config.LatticeConfig(*, shorten=False, aoc=False, html=False, ctxt=False, stat=False, pdf=True, png=False)[source]

Bases: object

Configuration for the ``lattice`` action.

Parameters:
  • shorten (bool) – Drop redundant columns altogether. Useful for big contexts, but loses information. The lattice shape and stats will be the same. Avoid using with –html

  • aoc (bool) – Only attribute and object concepts

  • html (bool) – Export to html

  • ctxt (bool) – Export as a context

  • stat (bool) – Output stats about the lattice

  • pdf (bool) – Export as pdf

  • png (bool) – Export as png

aoc: bool = False
ctxt: bool = False
html: bool = False
pdf: bool = True
png: bool = False
shorten: bool = False
stat: bool = False
class qumin.utils.config.OverabundantPatternsConfig(*, keep=False, freq=True, tags=None)[source]

Bases: object

Configuration for the processing of overabundant forms.

Parameters:
  • keep (bool) – Whether to keep overabundant entries

  • freq (bool) – Whether to prioritize overabundant forms by frequency (fallback on file order)

  • tags (Any | None) – Tags to prefer when dropping overabundance (fallback on freq)

freq: bool = True
keep: bool = False
tags: Any | None = None
class qumin.utils.config.PatternsConfig(*, kind=Kind.phon, defective=False, gap_proportion=0.4, optim_mem=False, overabundant=<factory>)[source]

Bases: object

Configuration for the patterns action.

Parameters:
  • kind (Kind) – Options are (see docs): phon, edits

  • defective (bool) – Whether to keep defective entries

  • gap_proportion (float) – Proportion of the median score used to set the gap score

  • optim_mem (bool) – Attempt to use a little bit less memory

  • overabundant (OverabundantPatternsConfig) – Configuration for overabundance

defective: bool = False
gap_proportion: float = 0.4
kind: Kind = 'phon'
optim_mem: bool = False
overabundant: OverabundantPatternsConfig
class qumin.utils.config.PredictabilityConfig(*, vis=True, n=<factory>, features=None, importResults=None, token_freq=<factory>)[source]

Bases: object

Configuration for entropy calculations.

Parameters:
  • vis (bool) – Whether to create a heatmap of the metrics and of interpredictability zones.

  • n (List[int]) – Compute entropy for prediction from with n predictors.

  • features (Any | None) – Feature column in the Lexeme table. Features will be considered known in conditional probabilities: P(X~Y|X,f1,f2…)

  • importResults (str | None) – Import previous entropy computation results. with any file, use to compute entropy heatmap with n-1 predictors, allows for acceleration on nPreds entropy computation.

  • token_freq (TokenFreqConfig) – Whether to use token frequencies for…

features: Any | None = None
importResults: str | None = None
n: List[int]
token_freq: TokenFreqConfig
vis: bool = True
class qumin.utils.config.QuminConfig(*, action=Actions.patterns, data, patterns=None, pos=None, lexemes=None, cells=None, sample_lexemes=None, sample_cells=None, force_random=False, seed=1, force=False, cpus=1, resegment=False, checkSegments=True, pats=<factory>, lattice=<factory>, heatmap=<factory>, pred=<factory>, entropy='${oc.deprecated:pred}')[source]

Bases: object

Parameters:
  • action (Actions) – Action, one of: patterns, pred (H is deprecated), lattice, macroclasses, heatmap, pred_heatmap (ent_heatmap is deprecated)

  • data (str) – Path to paralex.package.json paradigms, segments

  • cells (Any | None) – Cells to use (subset)

  • pos (Any | None) – Parts of speech to use (subset)

  • patterns (str | None) – Path to pattern computation metadata. If null, will compute patterns.

  • lexemes (str | None) – Lexemes to use (subset), path to a file with one lexeme id per row

  • sample_lexemes (int | None) – A number of lexemes to sample, for debug purposes.

  • sample_cells (int | None) – A number of cells to sample, for debug purposes. Samples by frequency if possible, otherwise randomly.

  • force_random (bool) – Whether to force random sampling.

  • seed (int) – Random seed for reproducible random effects.

  • force (bool) – Whether to overpass RAM usage security (2GB)

  • cpus (int) – Number of cpus to use for big computations Defaults to 1. 0 sets the number of available cpus to the maximum - 2. WARNING: cpus > 1 is unavailable for now in Windows and Mac. Whether to ignore spaces in phon forms and re-compute phonemic segmentation

  • resegment (bool) – Whether to resegment phonological forms.

  • checkSegments (bool) – Whether to control if all forms contain licit segments.

  • pats (PatternsConfig) – Configuration for the patterns action.

  • lattice (LatticeConfig) – Configuration for the lattice action.

  • heatmap (HeatmapConfig) – Configuration for the pred_heatmap action.

  • pred (PredictabilityConfig) – Configuration for the pred action.

  • entropy (PredictabilityConfig)

Changed in version 3.2.0: Namespace entropy is replaced by pred.

action: Actions = 'patterns'
cells: Any | None = None
checkSegments: bool = True
cpus: int = 1
data: str
entropy: PredictabilityConfig = '${oc.deprecated:pred}'
force: bool = False
force_random: bool = False
heatmap: HeatmapConfig
lattice: LatticeConfig
lexemes: str | None = None
pats: PatternsConfig
patterns: str | None = None
pos: Any | None = None
pred: PredictabilityConfig
resegment: bool = False
sample_cells: int | None = None
sample_lexemes: int | None = None
seed: int = 1
class qumin.utils.config.TokenFreqConfig(*, patterns=False, predictors=False, overabundant=False, cells=False)[source]

Bases: object

Whether to use token frequencies for…

Parameters:
  • patterns (bool) – The probability of the patterns.

  • predictors (bool) – The probability of the predictor classes and of the predictor forms.

  • overabundant (bool) – The weighting of overabundant cellmates

  • cells (bool) – The weighting of the measures across different cells.

cells: bool = False
overabundant: bool = False
patterns: bool = False
predictors: bool = False
qumin.utils.config.register_config()[source]

Registering the Config class with the name ‘qumin’.