qumin.clustering package¶

Submodules¶

qumin.clustering.algorithms module¶

Algorithms for inflection classes clustering.

Author: Sacha Beniamine

qumin.clustering.algorithms.choose(iterable)[source]¶: Choose a random element in an iterable of iterable. The iterable can have more than 1 dimension (but the choice will be done on the first dimension).

qumin.clustering.algorithms.hierarchical_clustering(patterns, paradigms, Clusters, **kwargs)[source]¶

Perform hierarchical clustering on patterns according to a clustering algorithm and a measure.

This function ::: Finds microclasses. Performs the clustering, Finds the macroclasses (and exports them), Returns the inflection class tree.

The clustering algorithm is the following:

Begin with one cluster per microclasses.
While there is more than one cluster :
    Find the best possible merge of two clusters, among all possible pairs.
    Perform this merge

Scoring, finding the best merges, merging nodes depends on the Clusters class.

Parameters:

patterns (patterns.ParadigmPatterns) – alternation patterns
paradigms (paradigms.Paradigms) – paradigms of forms
Clusters – a cluster class to use in clustering.
clustering_algorithm (Callable) – a clustering algorithm.
kwargs – any keywords arguments to pass to Clusters. Some keywords are mandatory : “md” should be the Metadata register, “patterns” should be a function for pattern finding

qumin.clustering.algorithms.log_classes(classes, md, suffix)[source]¶

qumin.clustering.descriptionlength module¶

Classes to make clustering decisions and build inflection class trees according to description length.

Author: Sacha Beniamine

class qumin.clustering.descriptionlength.BUDLClustersBuilder(microclasses, patterns, **kwargs)[source]¶

Bases: object

Builder for hierarchical clusters of inflection classes with description length based decisions.

This class holds two representations of the clusters it builds. On one hand, the class Cluster represents the informations needed to compute the description length of a cluster. On the other hand, the class Node represents the inflection classes being built. A Node can have children and a parent, a Cluster can be splitted or merged.

This class inherits attributes.

Variables:

microclasses (dict[str, list]) – Inherited. mapping of microclasses exemplars to microclasses inventories.
nodes (dict[frozenset, Node]) – Inherited. Maps frozensets of microclass exemplars to Nodes representing clusters.
preferences (dict) – Inherited. Configuration parameters.
attr (str) – (class attribute) always have the value “DL”, as the nodes of the Inflection class tree have a “DL” attribute.
DL (float) – A description length DL, with DL(system) = DL(M) + DL(C) + DL(P) + DL(R)
M (float) – DL(M), the cost in bits to express the mapping between lexemes and microclasses.
C (float) – DL(C), the cost in bits to express the mapping between microclasses and clusters.
P (float) – DL(P), the cost in bits to express the relation between clusters and patterns.
R (float) – DL(R), the cost in bits to disambiguiate which pattern to use in each cluster for each microclasses.
clusters (dict[frozenset, Cluster]) – Clusters, indexed by a frozenset of microclass examplars.
patterns (dict[str, collections.Counter]) –
A dict of pairs of cells to a count of patterns to the number of clusters presenting this pattern for this cell.:
```
{ str: Counter({Pattern: int }) }
pairs of cells -> pattern -> number of clusters with this pattern for this cell
```
Note that the Counter’s length is written on a .length attribute, to avoid calling len() repeatedly. Remark that the count is not the same as in the class Cluster.
size (int) – The size of the whole system in microclasses.

__init__(microclasses, patterns, **kwargs)[source]¶

Constructor.

Parameters:

microclasses (dict[str, list]) – mapping of microclasses exemplars to microclasses inventories.
patterns (patterns.ParadigmPatterns) – patterns
kwargs – keyword arguments to be used as configuration.

attr = 'DL'¶

compute_DL(M=False)[source]¶

find_ordered_merges()[source]¶

Find the list of all best merges of two clusters.

The list is a list of tuples of length 3 containing two frozensets representing the labels of the clusters to merge and the description length of the resulting system.

initialize_clusters(patterns)[source]¶

initialize_patterns()[source]¶

merge(a, b)[source]¶

Merge two Clusters, build a Node to represent the result, update the DL.

Parameters:

a (str) – the label of a cluster to merge.
b (str) – the label of a cluster to merge.

rootnode()[source]¶: Return the root of the Inflection Class tree, if it exists.

class qumin.clustering.descriptionlength.Cluster(*args)[source]¶

Bases: object

A single cluster in MDL clustering.

A Cluster is iterable. Itering on a cluster is itering on its patterns. Clusters can be merged or separated by adding or substracting them.

Variables:

patterns (collections.defaultdict) –
For each pair of cell in the paradigms under consideration, it holds a counter of the number of microclass using each pattern in this cluster and pair of cells.:
```
{ str: Counter({Pattern: int }) }
pairs of cells -> pattern -> number of microclasses using this pattern for this cell
```
Note that the Counter’s length is written on a .length attribute, to avoid calling len() repeatedly.
labels (set) – the set of all exemplars representing the microclasses in this cluster.
size (int) – The size of this cluster. Depending on external parameters, this can be the number of microclasses or the number of lexemes belonging to the cluster.
totalsize (int) – The size of the whole system of clusters, either number of microclasses in the system, or number of lexemes in the system.
R – The cost in bits to disambiguate for each pair of cells which pattern is to be used with which microclass.
C – The contribution of this cluster to the cost of mapping from microclasses to clusters.

__init__(*args)[source]¶

Initialize single cluster.

Parameters:: args (str) – Names (exemplar) of each microclass belonging to the cluster.

init_from_paradigm(class_size, patterns, size)[source]¶

Populate fields according to a paradigm column.

This assumes an initialization with only one microclass.

Parameters:

class_size (int) – the size of the microclass
patterns (patterns.ParadigmPatterns) – patterns
size (int) – total size

qumin.clustering.descriptionlength.weighted_log(symbol_count, message_length)[source]¶

Compute \(-log_{2}(symbol_count/message_length) * message_length\).

This corresponds to the product inside the sum of the description length formula when probabilities are estimated on frequencies.

Parameters:

symbol_count (int) – a count of symbols.
message_length (int) – the size of the message.

Returns:

the weighted log

Return type:

(float)

qumin.clustering.node module¶

class qumin.clustering.node.Node(labels, children=None, **kwargs)[source]¶

Bases: object

Represent an inflection class tree or lattice.

Variables:

labels (list) – labels of all the leaves under this node.
children (list) – direct children of this node.
attributes (dict) –
attributes for this node. Currently, three attributes are expected: size (int): size of the group represented by this node. DL (float): Description length for this node. color (str): color of the splines from this node to its children, in a format usable by pyplot. Currently, red (“r”) is used when the node didn’t decrease Description length, blue (“b”) otherwise. macroclass (bool): Is the node in a macroclass ? macroclass_root (bool): Is the node the root of a macroclass ?

The attributes “_x” and “_rank” are reserved, and will be overwritten by the draw function.

__init__(labels, children=None, **kwargs)[source]¶

Node constructor.

Parameters:

labels (Iterable) – labels of all the leaves under this node.
children (list) – direct children of this node.
**kwargs – any other keyword argument will be added as node attributes. Note that certain algorithm expect the Node to have (int) “size”, (str) “color”, (bool) “macroclass”, or (float) “DL” attributes.

Note

The attributes “_x” and “_rank” are reserved, and will be overwritten by the draw function.

draw(styler=None, seed=0, interactive=False, pos=None)[source]¶

Draw the hierarchy as a pyplot graph.

By default, the Qumin layout and the DefaultNodeStyler are used. You can pass a styler and a layout explicitly too:

>>> tree = Node("A", children=[Node("B"), Node("C")]) # minimal tree (A (B ) (C ))
>>> lines, nodes = tree.draw(DefaultNodeStyler, pos=tree.get_layout("spring"))

Arguments:

styler (qumin.lattice.stylers.LatticeStyler): Styling object
for advanced plot customization.

interactive (bool):
Whether this is destined to create an interactive plot.

pos (dict):
A dictionnary of node label to x,y positions. Compatible with networkx layout functions. If absent, wil use qumin’s default layout.

seed (float):
A seed for reproducible random effects.

get_layout(layout, y_factor=1, x_factor=15, seed=0)[source]¶

Calculates positions to draw the hierarchy.

The qumin layout is ideal for trees (eg. macroclasses) or for lattices unless they become really big. The y axis represents hierarchical relations. All leaves (microclasses) are laid out in a line at the bottom.

>>> tree = Node("A", children=[Node("B"), Node("C")]) # minimal tree (A (B ) (C ))
>>> pos = tree.get_layout("qumin")

All other layouts rely on networkx.

>>> pos = tree.get_layout("spring")

You could also just use any networkx pos dictionary by doing:

>>> nx_node = tree.to_networkx(seed=42)
>>> pos = nx.drawing.spring_layout(nx_node)

In all case, the layout can then be passed to Node.draw().

Args:

layout: layout name. Available options are: qumin, dot,
spring, kamada_kawai and radial.

y_factor: vertical node spacing, default 1. Only for qumin layout. (TODO: move to styler) seed: random seed.

Returns:
a dictionary of node labels to positions.

introduces_attributes()[source]¶

Tests whether the node introduces a new attribute or not.

Returns:: Whether the node introduces a new attribute.
Return type:: bool

introduces_objects()[source]¶

Tests whether the node introduces new objects or not.

Returns:: Whether this node introduces a new object.
Return type:: bool

is_supremum()[source]¶

Tests whether the node is the supremum or not.

Returns:: Whether the node is the supremum.
Return type:: bool

leaves()[source]¶

macroclasses(parent_is_macroclass=False)[source]¶: Find all the macroclasses nodes in this tree

to_networkx(seed=0)[source]¶

tree_string()[source]¶

Return the inflection class tree as a string with parenthesis.

Assumes size, DL and color attributes, with color = “r” if this is above a macroclass.

Example

In the label, fields are separated by “#” as such:

(<labels>#<size>#<DL>#<color> )

qumin.clustering.node.string_to_node(string, legacy_annotation_name=None)[source]¶

Parse an inflection tree written as a string.

Note: Keep or remove this function. It was created to read Macroclasses written: to the disk. It is not called currently by any script, but could be useful to users who would like to manipulate macroclass computations.

Example

In the label, fields are separated by “#” as such:

(<labels>#<size>#<DL>#<color> (... ) (... ) )

Returns:: The root of the tree
Return type:: Node

Module contents¶

qumin.clustering.find_min_attribute(tree, attr)[source]¶

Find the minimum value for an attribute in a tree.

Parameters:

tree (node.Node) – The tree in which to find the minimum attribute.
attr (str) – the attribute’s key.