Use a data subset¶

It is sometimes useful to run computations on a subset of data, for example to have comparable data sizes across languages, or in order to test the software on a smaller scale before running a large computation.

Tip

Have a look at the CLI reference to see all available options for sampling and subsetting.

Tip

In the case you import previously computed patterns, Qumin will retrieve the sampling or subsets used during the previous runs.

Sampling¶

Enable sampling a subset of the data by passing the number of desired cells and/or lexemes:

sample_lexeme=N will use a sample of N lexemes.
sample_cells=N will use a sample of N cells.

Of course, N needs to be a number smaller than the total of existing lexemes/cells.

By default, sampling takes the N most frequent items if your dataset comprises relevant frequency information (frequency column in the forms, the lexemes, or the cells, or a frequency table). To disable this, and sample randomly without frequency, pass force_random=true.

If sampling randomly, you can make your results reproducible by passing an integer random seed, for example seed=42.

For example, the following will sample the 5 most frequent cells and lexemes (assuming the datasets has frequency information):

qumin action=pred data=some-dataset.package.json sample_lexemes=5 sample_cells=5

But the following will sample 5 random cells and lexemes:

qumin action=pred data=some-dataset.package.json sample_lexemes=5 sample_cells=5 force_random=true

Specifying subsets¶

The alternative is to specify subsets of the data to run on.

POS subset¶

Some datasets comprise multiple parts of speech, and you may want to run your computations on some of these. For this, pass a list of values to pos, for example pos=[verb], or pos=[noun,verb]. This only works if there is a lexemes table, and it specifies which lexemes belong to which POS.

qumin action=pred data=some-dataset.package.json pos="[noun]"

Limiting cells¶

To run Qumin on a selected set of cells, pass a list to cell:

qumin action=pred data=some-dataset.package.json cells="[abl.sg, gen.sg, part.sg]"

Limiting lexemes¶

To run Qumin on a selected set of lexemes, you need to create a special file which lists the desired lexemes. This file should contain one lexeme identifier per line, and nothing else. The path to this file is then passed to lexemes. This is because it is frequent to select a few hundreds of lexemes, which makes passing a list directly too cumbersome.

qumin action=pred data=some-dataset.package.json lexemes=lex-subset.txt

With the following lex-subset.txt file:

walk
run
move
saunter