Write the paradigms file¶

This file relates phonological forms to their lexemes and paradigm cells.

Warning

The wide format, or Plat, where paradigms are given in rows, is no longer supported by qumin.

The format is that of the forms table in the Paralex standard (see Paralex datasets). It minimally looks like this:

form_id	lexeme	cell	phon_form
form-peler1	peler	prs.1sg	p ɛ l
form-peler2	peler	prs.2sg	p ɛ l
form-peler3	peler	prs.3sg	p ɛ l
form-peler4	peler	prs.1pl	p ə l ɔ̃
form-peler5	peler	prs.2pl	p ə l e
form-peler6	peler	prs.3pl	p ɛ l
form-peler7	peler	ipfv.1sg	p ə l E
form-peler8	peler	ipfv.2sg	p ə l E
form-peler9	peler	ipfv.3sg	p ə l E

Transcriptions¶

Qumin assumes transcribed phonemic or phonetic forms (rather than orthographic), where each phoneme is ideally separated by spaces. Using non-segmented strings (with no spaces) is not longer supported, as it can lead to ambiguities.

For Qumin, each sequence separated by spaces in the forms must be a distinct row in the sounds table. This is not always the case with paralex datasets, which are more tier-aware, and can accept forms like (1-3) with a sounds file which has just two separate rows for stress and length:

b ˈa b a
b a b aː
b ˈaː b a

Qumin is not (yet) tier-aware and can not read these forms as is. You will get an error like :

ValueError: Your paradigm has unknown segments:
  [ˈa] (in 1 forms: b ˈa b a)
  [aː] (in 1 forms: b a b aː)
  [ˈaː] (in 1 forms: b ˈaː b a)

There are two solutions:

The first is to re-compute the segmentation to have forms like (4-6). Qumin will do this automatically if you pass resegment=True, if your supra-segmentals are defined in terms of distinctive features.

b ˈ a b a
b a b a ː
b ˈ a ː b a

The second, alternative, solution is to add in the sounds table all combinations of segments + suprasegmentals (here, rows for [ˈa], [aː] and [ˈaː]).

Overabundance¶

Inflectional paradigms sometimes have some overabundant forms, where the same lexeme and paradigm cell can be realized in various ways, as in “dreamed” vs “dreamt” for the English past of “to dream”. Only some scripts can make use of this information, the other scripts will use the first value only.

In Paralex overabundant forms give rise to two (or more) distinct rows, keyed by the same lexeme and cell. Example:

lexeme	cell	form
bind	ppart	b aˑɪ n d
bind	ppart	b aˑʊ n d
wind(air)	ppart	w aˑʊ n d
wind(air)	ppart	w aˑɪ n d ɪ d
weave	ppart	w əˑʊ v n̩
weave	ppart	w iːv d
slink	ppart	s l ʌ ŋ k
slink	ppart	s l æ ŋ k
slink	ppart	s l ɪ ŋ k t

It is possible to pass an ordered list of tags to prefer when selecting overabundant forms.

/$ qumin pats.defective=True pats.overabundant.tags="[standard_variant,ayer_aye]" data=<dataset.package.json>

If no tags are given, or if no tags are available for a specific lexeme, Qumin will fallback on frequencies to pick a form to keep. If there are no frequencies or if pats.overabundant.freq is set to False, then Qumin will pick the first row among overabundant rows.

Defectivity¶

On the contrary, some lexemes might be defective for some cells, and have no values whatsoever for these cells. Following Paralex, Qumin expects these missing values to be written “#DEF#”. Note that some scripts ignore all lines with defective values.

Example:

lexeme	form	cell
advenir	#DEF#	prs.1sg
advenir	#DEF#	prs.2sg
advenir	a d v j ɛ̃	prs.3sg