How to write the paradigms file?¶
This file relates phonological forms to their lexemes and paradigm cells.
Warning
The wide format, or Plat, where paradigms are given in rows, is no longer supported by qumin.
The format is that of the forms table in the Paralex standard (see Paralex datasets). It minimally looks like this:
form_id |
lexeme |
cell |
phon_form |
|---|---|---|---|
form-peler1 |
peler |
prs.1sg |
p ɛ l |
form-peler2 |
peler |
prs.2sg |
p ɛ l |
form-peler3 |
peler |
prs.3sg |
p ɛ l |
form-peler4 |
peler |
prs.1pl |
p ə l ɔ̃ |
form-peler5 |
peler |
prs.2pl |
p ə l e |
form-peler6 |
peler |
prs.3pl |
p ɛ l |
form-peler7 |
peler |
ipfv.1sg |
p ə l E |
form-peler8 |
peler |
ipfv.2sg |
p ə l E |
form-peler9 |
peler |
ipfv.3sg |
p ə l E |
Transcriptions¶
Qumin assumes transcribed phonemic or phonetic forms (rather than orthographic), where each phoneme is ideally separated by spaces. Using non-segmented strings (with no spaces) is not longer supported, as it can lead to ambiguities.
For Qumin, each sequence separated by spaces in the forms must be a distinct row in the sounds table. This is not always the case with paralex datasets, which are more tier-aware, and can accept forms like (1-3) with a sounds file which has just two separate rows for stress and length:
b ˈa b a
b a b aː
b ˈaː b a
Qumin is not (yet) tier-aware and can not read these forms as is. You will get an error like :
ValueError: Your paradigm has unknown segments:
[ˈa] (in 1 forms: b ˈa b a)
[aː] (in 1 forms: b a b aː)
[ˈaː] (in 1 forms: b ˈaː b a)
There are two solutions:
The first is to re-compute the segmentation to have forms like (4-6). Qumin will do this automatically if you pass resegment=True, if your supra-segmentals are defined in terms of distinctive features.
b ˈ a b a
b a b a ː
b ˈ a ː b a
The second, alternative, solution is to add in the sounds table all combinations of segments + suprasegmentals (here, rows for [ˈa], [aː] and [ˈaː]).
Overabundance¶
Inflectional paradigms sometimes have some overabundant forms, where the same lexeme and paradigm cell can be realized in various ways, as in “dreamed” vs “dreamt” for the English past of “to dream”. Only some scripts can make use of this information, the other scripts will use the first value only.
In Paralex overabundant forms give rise to two (or more) distinct rows, keyed by the same lexeme and cell. Example:
lexeme |
cell |
form |
|---|---|---|
bind |
ppart |
b aˑɪ n d |
bind |
ppart |
b aˑʊ n d |
wind(air) |
ppart |
w aˑʊ n d |
wind(air) |
ppart |
w aˑɪ n d ɪ d |
weave |
ppart |
w əˑʊ v n̩ |
weave |
ppart |
w iːv d |
slink |
ppart |
s l ʌ ŋ k |
slink |
ppart |
s l æ ŋ k |
slink |
ppart |
s l ɪ ŋ k t |
It is possible to pass an ordered list of tags to prefer when selecting overabundant forms.
/$ qumin pats.defective=True pats.overabundant.tags="[standard_variant,ayer_aye]" data=<dataset.package.json>
If no tags are given, or if no tags are available for a specific lexeme, Qumin will fallback on frequencies to pick a form to keep. If there are no frequencies or if pats.overabundant.freq is set to False, then Qumin will pick the first row among overabundant rows.
Defectivity¶
On the contrary, some lexemes might be defective for some cells, and have no values whatsoever for these cells. Following Paralex, Qumin expects these missing values to be written “#DEF#”. Note that some scripts ignore all lines with defective values.
Example:
lexeme |
form |
cell |
|---|---|---|
advenir |
#DEF# |
prs.1sg |
advenir |
#DEF# |
prs.2sg |
advenir |
a d v j ɛ̃ |
prs.3sg |