Non-standard characters in IPA
Fix non-standard characters appearing in IPA section of entries (see full list in #56).
Here is the rewriting rules I use to handle most of the non standard symbols (here and elsewhere) :
alternatives = {
"\u02FA": "\u031A", # end high tone instead of combining left angle above
"\u0067": "\u0261", # normal g instead of script g
"\uF25F": "\u2C71", # v with right hook in certain fonts
"\u007C\u007C": "\u2016", # vertical line twice instead of double vertical line
"\u003A": "\u02D0", # colon instead of modifier triangular column
"\u0021": "\u01C3", # exclamation mark instead of retroflex click
"\u025A": "\u0259\u02DE", # schwa with hook instead of schwa + rhotic hook
"\u025D": "\u025C\u02DE", # reversed open e with hook instead of reversed open e + rhotic hook
"\u02A3": "\u0064\u0361\u007A", # affricate
"\u02A4": "\u0064\u0361\u0292", # affricate
"\u02A5": "\u0064\u0361\u0291", # affricate
"\u02A6": "\u0074\u0361\u0073", # affricate
"\u02A7": "\u0074\u0361\u0283", # affricate
"\u02A8": "\u0074\u0361\u0255", # affricate
"\u03B5": "\u025B", # epsilon instead of open e
"\u01DD": "\u0259", # turned e instead of schwa
"\u026B": "\u006C\u0334", # L with Middle Tilde instead of L + combining tilde overlay
"\u200D": "\u035C", # used in en_UK.txt for tied phonemes it seems
"\u0020": "", # blank space for word separation, might be kept
"\u00B2": "", # is before most sv.txt pronunciation, seems to indicate nothing
"\u2040": "\u203F", # used once in de.txt for linking
"\u0030\u0072": "\u0072\u0325", # 0 is used in is.txt instead of voiceless diacritic
"\u0030": "\u0072\u0325", # þverfaglegt is missing the r in is.txt
"\u0023": "\u002E", # seems to be used for non diphthong sounds in is.txt, replaced by full stop
"\u0027": "\u02C8", # using apostrophe instead of primary stress
"\u030D": "\u0329", # vertical line above instead of vertical line under, normal for some letters, might be kept
"\u2193": "\uA71C", # downwards arrow instead of raised down arrow
"\u2191": "\uA71B", # upwards arrow instead of raised up arrow
"\u1EA1": "\u0061", # Naz in de.txt
"\u005F": "\u0063", # _ instead of c in is.txt
"\u002D": "", # word separator in multiple languages, might be kept
"\u007E": "", # framtíðarhorfur in is.txt
"\u2014": "", # in der Pipeline in de.txt
"\u003F": "", # before some pronunciation in sv.txt + เญียง in tts.txt, seems to indicate nothing
"\u0311": "\u032F", # combining inverted breve above instead of combining inverted breve under, normal for some letters, might be kept
"\u02B1": "\u0324", # Modifier small h with hook for breathy voiced instead of combining diaresis below
"\u02C0": "\u0330", # modifier glottal stop for creaky voiced instead of combining tilde below
"\u0348": "", # non IPA, used in ko.txt for tensed consonants/faucalized voice, maybe need to be kept still ?
"\u1d50\u0253": "\u006D\u0361\u0253", # prenasalization
"\u1d50\u0076": "\u006D\u0361\u0076", # prenasalization
"\u1D51\u0261": "\u014B\u0361\u0261", # prenasalization
}
This handle most of the errors except for ja.txt and fa.txt which contains japaneze/arab symbol I don't know how to remove.
Except these rules, there are some parenthesis and the ʶ (\u02B6) that happens in nor ni and nu wor in de.txt that I don't know how to handle.
I gave with the rules a comment on what I saw when inspecting the data, with some reserve sometimes (e.g. as said by the International Phonetic Association, "Some diacritics may be placed above a symbol with a descender."). These rewriting rules might not always apply here (some are safe-guard for data from other sources).
Also, some symbols are used without being translatable in IPA (faucalized voice in korean, dashes as word separator etc), not sure it is advisable to delete them, although issuing a warning somewhere could be nice then.
Also, a link to the official IPA chart. Might come in handy, this chart also contains Unicode symbols.
Thanks @RobinSobczyk! I have gone through the original list you provided in #56 and fixed all of the characters that appear to not be IPA. A lot of them were simply errors in the source data --- for example the 0 included in is.txt which obviously was meant to be an underring diacritic indicating voicelessness, or the ASCII apostrophe used in many places instead of the IPA primary stress marker.
Other examples included brackets or forward slashes intended to indicate alternative pronunciations which have now been reformatted accordingly, and various junk characters that found their way into the data at some point, likely during conversion, extraction, or processing.
A few items that you listed have not been changed as they seem to be valid IPA characters in their context. These are listed below:
-
ᵝ, hex code : 0x1d5d- Used only in the Japanese dictionary. The superscript form here specifically indicates compressed vowel roundedness.
-
ᵐ, hex code : 0x1d50- Used only in the Swahili dictionary. Another superscript form, this time indicating a bilabial nasal consonant. Included in extIPA character set. We could arguably consider replacing this with plain
msince this is intended to be a phonemic rather than phonetic representation.
- Used only in the Swahili dictionary. Another superscript form, this time indicating a bilabial nasal consonant. Included in extIPA character set. We could arguably consider replacing this with plain
-
̍, hex code : 0x30d- Occurrences are all in German file. This is the Combining Vertical Line Above, and "marks syllabicity on a letter with a descender, such as ⟨
ŋ̍⟩".
- Occurrences are all in German file. This is the Combining Vertical Line Above, and "marks syllabicity on a letter with a descender, such as ⟨
-
⁀, hex code : 0x2040- Only in
de.txt. This is a tie bar, a ligature used in IPA notation to indicate double articulation among other things. It is only used in one entry in the dictionary, but does not appear to be incorrect or non-standard.
- Only in
-
͈, hex code : 0x348- This is used only in the Korean data. It is a Combining double vertical line below, "used to denote the tensed consonants /p͈/, /t͈/, /k͈/, /t͈ɕ/, /s͈/" and "used in literature in the context of Korean phonology for faucalized voice". Also included in extIPA. As with Swahili, would consider replacing or removing this with a more phonemic representation, but this can wait for subject matter experts to weigh in.
-
̑, hex code : 0x311- Only used in German file and only together with
y. This is an inverted breve below. It indicates thatyis non-syllabic (in other words, a semivowel). Will leave it up to German linguists to decide whether this is a sufficiently important distinction to preserve.
- Only used in German file and only together with
@RobinSobczyk before closing this issue, could you run the current version of the data through your script again and see if there are any errors that were missed (aside from the exceptions listed above)?
I still find the following non IPA symbols while applying NFD decomposition of unicode symbols (most of them being handled in the rewriting rules I gave) :
-
#, hex code : 0x23 -
², hex code : 0xb2 -
ᵐ, hex code : 0x1d50 -> already discussed -
̍, hex code : 0x30d -> already discussed -
ˀ, hex code : 0x2c0 -
ɚ, hex code : 0x25a -
ɝ, hex code : 0x25d -
ǧ, hex code : 0x1e7 -
g, hex code : 0x67 -
ĝ, hex code : 0x11d -
ᵝ, hex code : 0x1d5d -> already discussed -
ʶ, hex code : 0x2b6 -
:, hex code : 0x3a -
⁀, hex code : 0x2040 -> already discussed -
͈, hex code : 0x348 -> already discussed -
ğ, hex code : 0x11f -
ɫ, hex code : 0x26b -
̑, hex code : 0x311 -> already discussed -
ʱ, hex code : 0x2b1 -
-, hex code : 0x2d -
, hex code : 0x20 -
ʤ, hex code : 0x2a4
I'd like to add that I think ⁀ \u2040 is a linking more than a tie (as it is between two words). Hence, changing it to the regular linking symbol would be nice (IPA provide ties above and below, but only below linking from what I see).
Also, letters with g might be flagged because I excluded the classic g from admitted symbols. Usually, those with accents would also be caught by NFD + rewriting rules.
I can also run the script without NFD if you want, to see which symbols might have to be replaced by their unicode decomposition.
Hi, Can I help in any way to solve and close this issue ? Like providing a (python) script to apply rewriting rules or anything ?
@RobinSobczyk I have gone through the list you provided and made some changes -- further details for each point are below:
- [x]
#, hex code :0x23- all in
is.txt, seem to be secondary stress markers (now removed)
- all in
- [ ]
², hex code :0xb2- all in
sv.txt, represents tone pitch pattern 2
- all in
- [ ]
ᵐ, hex code :0x1d50` already discussed- all in
sv.txt, represents prenasalized labial consonant
- all in
- [ ]
̍, hex code :0x30dalready discussed- all in
de.txt, marks syllabicity on a letter with a descender (e.g.,ŋ̍⟩
- all in
- [ ]
ˀ, hex code :0x2c0- all in
vi_C.txtandvi_N.txt, represents glottalized consonant
- all in
- [ ]
ɚ, hex code :0x25a- appears in
zh.txtandde.txt, represents R-colored vowel
- appears in
- [ ]
ɝ, hex code :0x25d- appears only in
en_US, represents R-colored vowel
- appears only in
- [ ]
ǧ, hex code :0x1e7- not found in any file
- [ ]
g, hex code :0x67- loop-tail
g, acceptable graphic variant of open-tailɡin IPA
- loop-tail
- [ ]
ĝ, hex code :0x11d- only appears in
fa.txt, should perhaps be replaced by voiced uvular stopɢ, but this should be confirmed by someone with expertise in Persian phonology before changing
- only appears in
- [ ]
ᵝ, hex code :0x1d5dalready discussed- all in
ja.txt, represents compressed unrounded vowel
- all in
- [ ]
ʶ, hex code :0x2b6- only found in
de.txt, represents uvularization
- only found in
- [ ]
:, hex code :0x3a- present in many files; in all cases this is standing in for the IPA triangular colon
ːrepresenting a long vowel or geminated consonant - in principle, these could be search and replaced with
ː, but as with the loop-tailgabove, this may or may not be necessary and makes the files a little easier to work with (however, comment from others using the data files is welcomed)
- present in many files; in all cases this is standing in for the IPA triangular colon
- [ ]
⁀, hex code :0x2040already discussed - [ ]
͈, hex code :0x348already discussed- all in
ko.txt, represents faucalized voice in Korean
- all in
- [x]
ğ, hex code :0x11f- only one instance, in
fa.txt-- replaced withɢ(voiced uvular plosive)
- only one instance, in
- [ ]
ɫ, hex code :0x26b- used in
de.txtanden_US.txt, represents voiced velarized alveolar approximant
- used in
- [ ]
̑, hex code :0x311already discussed- all in
de.txt, this is an inverted breve used to represent non-syllabic vowels
- all in
- [ ]
ʱ, hex code :0x2b1- all in
or.txt, represents voiced aspirated consonants in Odia
- all in
- [x]
-, hex code :0x2d- used in a number of files, mostly to separate syllables
- some could be deleted easily such as
ma.txt - others could be removed by search and replace such as
vi-*.txt, but the syllable separation seems to be useful and losing it would be less useful - also present in
de.txtwhere it's a bit trickier to remove -- these will have to be corrected one entry at a time- used for a number of purposes in
de.txt:- to indicate consonant gemination (this has now been replaced with
ːafter the initial consonant) - to indicate word boundaries even if the original orthography did not include a hyphen (for example
Jiu Jitsu) -- these have now been removed - to indicate initial identical part of a pronunciation when there are two or more in total (this has now been fixed)
- replication of the usage in the headword (for example, when headword is a prefix -- these have now been fixed)
- to indicate consonant gemination (this has now been replaced with
- used for a number of purposes in
- [ ] ``, hex code :
0x20- not found in any file
- [x]
ʤ, hex code :0x2a4- deprecated digraph representing various kinds of voiced alveolar affricate
- has been replaced by
d͡ʒbut is still widely used since the tie bar is annoying in practice - appears mostly in
nl.txt(many examples) andde.txt(just once) - these have now been replaced in all files
As you can see from the list above, it seems that your script may be flagging some characters that are in fact acceptable in IPA, or useful for users of the data. As noted, these have not been changed.
Hi ! Thanks for the update, I updated my code to report as well the files in which the symbol are found. I grouped them depending on what I thought about them :
IPA uniformizations that sound straightforward to me :
-
g, hex code : 0x67 ['es_ES.txt', 'fr_QC.txt', 'sv.txt', 'ma.txt', 'fr_FR.txt', 'jam.txt', 'eo.txt', 'or.txt', 'tts.txt', 'es_MX.txt', 'nb.txt', 'sw.txt', 'ja.txt', 'pt_BR.txt', 'fi.txt', 'ro.txt', 'km.txt']- I know it's an acceptable variant of loop tail g, but I think it's nice to uniformize them throughout the files when the standard is given by the IPA
-
:, hex code : 0x3a ['jam.txt', 'yue.txt', 'tts.txt']- same as for g, uniformizing and make everything follow IPA to the closest might be nice
-
ɚ, hex code : 0x25a ['zh_hans.txt', 'zh_hant.txt', 'de.txt']- can be rewritten with
əand˞to follow IPA
- can be rewritten with
-
ɝ, hex code : 0x25d ['en_US.txt']- can be rewritten with
ɜand˞to follow IPA
- can be rewritten with
-
ɫ, hex code : 0x26b ['en_US.txt', 'de.txt']- can be rewritten with
land̴(the overlay tilde which correspond to velarized)
- can be rewritten with
-
⁀, hex code : 0x2040 ['de.txt']- can be replaced with undertie to match IPA linking symbol
Questions :
-
², hex code : 0xb2 ['sv.txt']- can it be written with the tone symbols from IPA ?
-
-, hex code : 0x2d ['vi_C.txt', 'ma.txt', 'vi_S.txt', 'or.txt', 'ko.txt', 'vi_N.txt', 'sw.txt']- can't it be replaced by
.as it is what IPA provide for syllable break ?
- can't it be replaced by
-
ʱ, hex code : 0x2b1 ['or.txt']- can't voiced aspirated be made by combining the IPA symbols
̬for voiced andʰfor aspirated ?
- can't voiced aspirated be made by combining the IPA symbols
-
̑, hex code : 0x311 ['de.txt']- to me it seems that it's always used on
y, so it might fall in the case of false positive because it's just a letter with a descender ?
- to me it seems that it's always used on
Might not be solvable (at least with current IPA) :
-
ˀ, hex code : 0x2c0 ['vi_C.txt', 'vi_N.txt']- might not be expressible otherwise
-
ᵐ, hex code : 0x1d50 ['sw.txt']- might not be expressible otherwise
-
ᵝ, hex code : 0x1d5d ['ja.txt']- might not be expressible otherwise
-
ʶ, hex code : 0x2b6 ['de.txt']- might not be expressible otherwise
-
͈, hex code : 0x348 ['ko.txt']- no possible alternative
-
ĝ, hex code : 0x11d ['fa.txt']- require more expertise
False positive :
-
, hex code : 0x20 ['ko.txt', 'fi.txt', 'vi_C.txt', 'fr_QC.txt', 'jam.txt', 'zh_hans.txt', 'is.txt', 'sv.txt', 'km.txt', 'zh_hant.txt', 'de.txt', 'en_UK.txt', 'fr_FR.txt', 'eo.txt', 'nb.txt', 'yue.txt', 'vi_S.txt', 'vi_N.txt', 'ja.txt', 'fa.txt']- really just a space, I think it is flagged because of multi word pronunciations so let's ignore it
-
̍, hex code : 0x30d ['de.txt']- used when placed with descenders, which is IPA compliant
In the end, I think that having something that is as close as possible is important so the dataset can be easily used through code. Also, I think it might be interesting to keep somewhere a note about the non-IPA symbols that where kept and what they mean so people like me with not that big pronunciation knowledge don't wonder why those symbols are here. README could be the perfect place to be honest. That way it is also easy to keep track of it if one day IPA evolves in a way that include them, with potentially different symbols.
Thank you again for your work and those updates !