qumin.clustering package

Submodules

qumin.clustering.algorithms module

Algorithms for inflection classes clustering.

Author: Sacha Beniamine

qumin.clustering.algorithms.choose(iterable)[source]

Choose a random element in an iterable of iterable. The iterable can have more than 1 dimension (but the choice will be done on the first dimension).

qumin.clustering.algorithms.hierarchical_clustering(patterns, paradigms, Clusters, **kwargs)[source]

Perform hierarchical clustering on patterns according to a clustering algorithm and a measure.

This function ::

Finds microclasses. Performs the clustering, Finds the macroclasses (and exports them), Returns the inflection class tree.

The clustering algorithm is the following:

Begin with one cluster per microclasses.
While there is more than one cluster :
    Find the best possible merge of two clusters, among all possible pairs.
    Perform this merge

Scoring, finding the best merges, merging nodes depends on the Clusters class.

Parameters:
  • patterns (patterns.ParadigmPatterns) – alternation patterns

  • paradigms (paradigms.Paradigms) – paradigms of forms

  • Clusters – a cluster class to use in clustering.

  • clustering_algorithm (Callable) – a clustering algorithm.

  • kwargs – any keywords arguments to pass to Clusters. Some keywords are mandatory : “md” should be the Metadata register, “patterns” should be a function for pattern finding

qumin.clustering.algorithms.log_classes(classes, md, suffix)[source]

qumin.clustering.descriptionlength module

Classes to make clustering decisions and build inflection class trees according to description length.

Author: Sacha Beniamine

class qumin.clustering.descriptionlength.BUDLClustersBuilder(microclasses, patterns, **kwargs)[source]

Bases: object

Builder for hierarchical clusters of inflection classes with description length based decisions.

This class holds two representations of the clusters it builds. On one hand, the class Cluster represents the informations needed to compute the description length of a cluster. On the other hand, the class Node represents the inflection classes being built. A Node can have children and a parent, a Cluster can be splitted or merged.

This class inherits attributes.

Variables:
  • microclasses (dict[str, list]) – Inherited. mapping of microclasses exemplars to microclasses inventories.

  • nodes (dict[frozenset, Node]) – Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.

  • preferences (dict) – Inherited. Configuration parameters.

  • attr (str) – (class attribute) always have the value “DL”, as the nodes of the Inflection class tree have a “DL” attribute.

  • DL (float) – A description length DL, with DL(system) = DL(M) + DL(C) + DL(P) + DL(R)

  • M (float) – DL(M), the cost in bits to express the mapping between lexemes and microclasses.

  • C (float) – DL(C), the cost in bits to express the mapping between microclasses and clusters.

  • P (float) – DL(P), the cost in bits to express the relation between clusters and patterns.

  • R (float) – DL(R), the cost in bits to disambiguiate which pattern to use in each cluster for each microclasses.

  • clusters (dict[frozenset, Cluster]) – Clusters, indexed by a frozenset of microclass examplars.

  • patterns (dict[str, collections.Counter]) –

    A dict of pairs of cells to a count of patterns to the number of clusters presenting this pattern for this cell.:

    { str: Counter({Pattern: int }) }
    pairs of cells -> pattern -> number of clusters with this pattern for this cell
    

    Note that the Counter’s length is written on a .length attribute, to avoid calling len() repeatedly. Remark that the count is not the same as in the class Cluster.

  • size (int) – The size of the whole system in microclasses.

__init__(microclasses, patterns, **kwargs)[source]

Constructor.

Parameters:
  • microclasses (dict[str, list]) – mapping of microclasses exemplars to microclasses inventories.

  • patterns (patterns.ParadigmPatterns) – patterns

  • kwargs – keyword arguments to be used as configuration.

attr = 'DL'
compute_DL(M=False)[source]
find_ordered_merges()[source]

Find the list of all best merges of two clusters.

The list is a list of tuples of length 3 containing two frozensets representing the labels of the clusters to merge and the description length of the resulting system.

initialize_clusters(patterns)[source]
initialize_patterns()[source]
merge(a, b)[source]

Merge two Clusters, build a Node to represent the result, update the DL.

Parameters:
  • a (str) – the label of a cluster to merge.

  • b (str) – the label of a cluster to merge.

rootnode()[source]

Return the root of the Inflection Class tree, if it exists.

class qumin.clustering.descriptionlength.Cluster(*args)[source]

Bases: object

A single cluster in MDL clustering.

A Cluster is iterable. Itering on a cluster is itering on its patterns. Clusters can be merged or separated by adding or substracting them.

Variables:
  • patterns (collections.defaultdict) –

    For each pair of cell in the paradigms under consideration, it holds a counter of the number of microclass using each pattern in this cluster and pair of cells.:

    { str: Counter({Pattern: int }) }
    pairs of cells -> pattern -> number of microclasses using this pattern for this cell
    

    Note that the Counter’s length is written on a .length attribute, to avoid calling len() repeatedly.

  • labels (set) – the set of all exemplars representing the microclasses in this cluster.

  • size (int) – The size of this cluster. Depending on external parameters, this can be the number of microclasses or the number of lexemes belonging to the cluster.

  • totalsize (int) – The size of the whole system of clusters, either number of microclasses in the system, or number of lexemes in the system.

  • R – The cost in bits to disambiguate for each pair of cells which pattern is to be used with which microclass.

  • C – The contribution of this cluster to the cost of mapping from microclasses to clusters.

__init__(*args)[source]

Initialize single cluster.

Parameters:

args (str) – Names (exemplar) of each microclass belonging to the cluster.

init_from_paradigm(class_size, patterns, size)[source]

Populate fields according to a paradigm column.

This assumes an initialization with only one microclass.

Parameters:
  • class_size (int) – the size of the microclass

  • patterns (patterns.ParadigmPatterns) – patterns

  • size (int) – total size

qumin.clustering.descriptionlength.weighted_log(symbol_count, message_length)[source]

Compute \(-log_{2}(symbol_count/message_length) * message_length\).

This corresponds to the product inside the sum of the description length formula when probabilities are estimated on frequencies.

Parameters:
  • symbol_count (int) – a count of symbols.

  • message_length (int) – the size of the message.

Returns:

the weighted log

Return type:

(float)

qumin.clustering.node module

class qumin.clustering.node.Node(labels, children=None, **kwargs)[source]

Bases: object

Represent an inflection class tree or lattice.

Variables:
  • labels (list) – labels of all the leaves under this node.

  • children (list) – direct children of this node.

  • attributes (dict) –

    attributes for this node. Currently, three attributes are expected: size (int): size of the group represented by this node. DL (float): Description length for this node. color (str): color of the splines from this node to its children, in a format usable by pyplot. Currently, red (“r”) is used when the node didn’t decrease Description length, blue (“b”) otherwise. macroclass (bool): Is the node in a macroclass ? macroclass_root (bool): Is the node the root of a macroclass ?

    The attributes “_x” and “_rank” are reserved, and will be overwritten by the draw function.

__init__(labels, children=None, **kwargs)[source]

Node constructor.

Parameters:
  • labels (Iterable) – labels of all the leaves under this node.

  • children (list) – direct children of this node.

  • **kwargs – any other keyword argument will be added as node attributes. Note that certain algorithm expect the Node to have (int) “size”, (str) “color”, (bool) “macroclass”, or (float) “DL” attributes.

Note

The attributes “_x” and “_rank” are reserved, and will be overwritten by the draw function.

draw(horizontal=False, square=False, leavesfunc=<function Node.<lambda>>, nodefunc=None, label_rotation=None, annotateOnlyMacroclasses=False, point=None, edge_attributes=None, interactive=False, layout=False, pos=None, **kwargs)[source]

Draw the tree as a dendrogram-style pyplot graph.

Example:

                          square=True        square=False

                         │  ┌──┴──┐         │    ╱╲
horizontal=False         │  │   ┌─┴─┐       │   ╱  ╲
                         │  │   │   │       │  ╱   ╱╲
                         │  │   │   │       │ ╱   ╱  ╲
                         │__│___│___│       │╱___╱____╲

                        │─────┐             │⟍
                        │───┐ ├             │  ⟍
horizontal=True         │   ├─┘             │⟍ ⟋
                        │───┘               │⟋
                        │____________       │____________
Parameters:
  • horizontal (bool) – Should the tree be drawn with leaves on the y axis ? (Defaults to False: leaves on x axis).

  • square (bool) – Should the tree splines be squared with 90° angles ? (Defauls to False)

  • leavesfunc (Callable) – A function that will be applied to leaves before writing them down. Takes a Node, returns a str.

  • nodefunc (Callable) – A function that will be applied to nodes to annotate them. Takes a Node, returns a str.

  • keep_above_macroclass (bool) – For macroclass history trees: Should the edges above macroclasses be drawn ? (Defaults to True).

  • annotateOnlyMacroclasses – For macroclass history trees: If True and nodelabel isn’t None, only the macroclasses nodes are annotated.

  • point (Callable) – A function that maps a node to point attributes.

  • edge_attributes (Callable) –

    A function that maps a pair of nodes to edge attributes.

    By default, use the parent’s color and “-” linestyle for nodes, “–” for leaves.

  • interactive (bool) – Whether this is destined to create an interactive plot.

  • layout (bool) – layout keyword, either of “qumin” or “dot”. Ignored if pos is given.

  • pos (dict) – A dictionnary of node label to x,y positions. Compatible with networkx layout functions. If absent, use networkx’s graphviz layout.

leaves()[source]
macroclasses(parent_is_macroclass=False)[source]

Find all the macroclasses nodes in this tree

to_networkx()[source]
to_tikz(leavesfunc=None, nodefunc=None, layout='qumin', pos=None, ratio=(1, 1), width=20, color_attrs=None, y_factor=0.8)[source]
tree_string()[source]

Return the inflection class tree as a string with parenthesis.

Assumes size, DL and color attributes, with color = “r” if this is above a macroclass.

Example

In the label, fields are separated by “#” as such:

(<labels>#<size>#<DL>#<color> )
qumin.clustering.node.string_to_node(string, legacy_annotation_name=None)[source]

Parse an inflection tree written as a string.

Example

In the label, fields are separated by “#” as such:

(<labels>#<size>#<DL>#<color> (... ) (... ) )
Returns:

The root of the tree

Return type:

Node

Module contents

qumin.clustering.find_microclasses(paradigms, patterns, freqs=None)[source]

Find microclasses in a paradigm (lines with identical rows).

This is useful to identify an exemplar of each inflection microclass, and limit further computation to the collection of these exemplars.

Parameters:
  • paradigms (pandas.DataFrame) – a dataframe containing inflectional paradigms. rows describe a pattern between forms from a given lexeme for a given cell.

  • freqs (pandas.Series) – a series of frequencies for each lemma

Returns:

microclasses (dict).

classes is a dict. Its keys are exemplars, its values are lists of the name of rows identical to the exemplar. Each exemplar represents a macroclass.

{"a":["a","A","aa"], "b":["b","B","BBB"]}

qumin.clustering.find_min_attribute(tree, attr)[source]

Find the minimum value for an attribute in a tree.

Parameters:
  • tree (node.Node) – The tree in which to find the minimum attribute.

  • attr (str) – the attribute’s key.