ipa-dict icon indicating copy to clipboard operation
ipa-dict copied to clipboard

Non-standard characters in IPA

Open dohliam opened this issue 10 months ago • 6 comments

Fix non-standard characters appearing in IPA section of entries (see full list in #56).

dohliam avatar Mar 06 '25 16:03 dohliam

Here is the rewriting rules I use to handle most of the non standard symbols (here and elsewhere) :

alternatives = {
    "\u02FA": "\u031A",  # end high tone instead of combining left angle above
    "\u0067": "\u0261",  # normal g instead of script g
    "\uF25F": "\u2C71",  # v with right hook in certain fonts
    "\u007C\u007C": "\u2016",  # vertical line twice instead of double vertical line
    "\u003A": "\u02D0",  # colon instead of modifier triangular column
    "\u0021": "\u01C3",  # exclamation mark instead of retroflex click
    "\u025A": "\u0259\u02DE",  # schwa with hook instead of schwa + rhotic hook
    "\u025D": "\u025C\u02DE",  # reversed open e with hook instead of reversed open e + rhotic hook
    "\u02A3": "\u0064\u0361\u007A",  # affricate
    "\u02A4": "\u0064\u0361\u0292",  # affricate
    "\u02A5": "\u0064\u0361\u0291",  # affricate
    "\u02A6": "\u0074\u0361\u0073",  # affricate
    "\u02A7": "\u0074\u0361\u0283",  # affricate
    "\u02A8": "\u0074\u0361\u0255",  # affricate
    "\u03B5": "\u025B",  # epsilon instead of open e
    "\u01DD": "\u0259",  # turned e instead of schwa
    "\u026B": "\u006C\u0334",  # L with Middle Tilde instead of L + combining tilde overlay
    "\u200D": "\u035C",  # used in en_UK.txt for tied phonemes it seems
    "\u0020": "",  # blank space for word separation, might be kept
    "\u00B2": "",  # is before most sv.txt pronunciation, seems to indicate nothing
    "\u2040": "\u203F",  # used once in de.txt for linking
    "\u0030\u0072": "\u0072\u0325",  # 0 is used in is.txt instead of voiceless diacritic
    "\u0030": "\u0072\u0325",  # þverfaglegt is missing the r in is.txt
    "\u0023": "\u002E",  # seems to be used for non diphthong sounds in is.txt, replaced by full stop
    "\u0027": "\u02C8",  # using apostrophe instead of primary stress
    "\u030D": "\u0329",  # vertical line above instead of vertical line under, normal for some letters, might be kept
    "\u2193": "\uA71C",  # downwards arrow instead of raised down arrow
    "\u2191": "\uA71B",  # upwards arrow instead of raised up arrow
    "\u1EA1": "\u0061",  # Naz in de.txt
    "\u005F": "\u0063",  # _ instead of c in is.txt
    "\u002D": "",  # word separator in multiple languages, might be kept
    "\u007E": "",  # framtíðarhorfur in is.txt
    "\u2014": "",  # in der Pipeline in de.txt
    "\u003F": "",  # before some pronunciation in sv.txt + เญียง in tts.txt, seems to indicate nothing
    "\u0311": "\u032F",  # combining inverted breve above instead of combining inverted breve under, normal for some letters, might be kept
    "\u02B1": "\u0324",  # Modifier small h with hook for breathy voiced instead of combining diaresis below
    "\u02C0": "\u0330",  # modifier glottal stop for creaky voiced instead of combining tilde below
    "\u0348": "",  # non IPA, used in ko.txt for tensed consonants/faucalized voice, maybe need to be kept still ?
    "\u1d50\u0253": "\u006D\u0361\u0253",  # prenasalization
    "\u1d50\u0076": "\u006D\u0361\u0076",  # prenasalization
    "\u1D51\u0261": "\u014B\u0361\u0261",  # prenasalization
}

This handle most of the errors except for ja.txt and fa.txt which contains japaneze/arab symbol I don't know how to remove. Except these rules, there are some parenthesis and the ʶ (\u02B6) that happens in nor ni and nu wor in de.txt that I don't know how to handle.

I gave with the rules a comment on what I saw when inspecting the data, with some reserve sometimes (e.g. as said by the International Phonetic Association, "Some diacritics may be placed above a symbol with a descender."). These rewriting rules might not always apply here (some are safe-guard for data from other sources).

Also, some symbols are used without being translatable in IPA (faucalized voice in korean, dashes as word separator etc), not sure it is advisable to delete them, although issuing a warning somewhere could be nice then.

Also, a link to the official IPA chart. Might come in handy, this chart also contains Unicode symbols.

RobinSobczyk avatar Mar 10 '25 11:03 RobinSobczyk

Thanks @RobinSobczyk! I have gone through the original list you provided in #56 and fixed all of the characters that appear to not be IPA. A lot of them were simply errors in the source data --- for example the 0 included in is.txt which obviously was meant to be an underring diacritic indicating voicelessness, or the ASCII apostrophe used in many places instead of the IPA primary stress marker.

Other examples included brackets or forward slashes intended to indicate alternative pronunciations which have now been reformatted accordingly, and various junk characters that found their way into the data at some point, likely during conversion, extraction, or processing.

A few items that you listed have not been changed as they seem to be valid IPA characters in their context. These are listed below:

  • , hex code : 0x1d5d
    • Used only in the Japanese dictionary. The superscript form here specifically indicates compressed vowel roundedness.
  • , hex code : 0x1d50
    • Used only in the Swahili dictionary. Another superscript form, this time indicating a bilabial nasal consonant. Included in extIPA character set. We could arguably consider replacing this with plain m since this is intended to be a phonemic rather than phonetic representation.
  • ̍, hex code : 0x30d
    • Occurrences are all in German file. This is the Combining Vertical Line Above, and "marks syllabicity on a letter with a descender, such as ⟨ŋ̍⟩".
  • , hex code : 0x2040
    • Only in de.txt. This is a tie bar, a ligature used in IPA notation to indicate double articulation among other things. It is only used in one entry in the dictionary, but does not appear to be incorrect or non-standard.
  • ͈, hex code : 0x348
    • This is used only in the Korean data. It is a Combining double vertical line below, "used to denote the tensed consonants /p͈/, /t͈/, /k͈/, /t͈ɕ/, /s͈/" and "used in literature in the context of Korean phonology for faucalized voice". Also included in extIPA. As with Swahili, would consider replacing or removing this with a more phonemic representation, but this can wait for subject matter experts to weigh in.
  • ̑, hex code : 0x311
    • Only used in German file and only together with y. This is an inverted breve below. It indicates that y is non-syllabic (in other words, a semivowel). Will leave it up to German linguists to decide whether this is a sufficiently important distinction to preserve.

@RobinSobczyk before closing this issue, could you run the current version of the data through your script again and see if there are any errors that were missed (aside from the exceptions listed above)?

dohliam avatar Mar 11 '25 08:03 dohliam

I still find the following non IPA symbols while applying NFD decomposition of unicode symbols (most of them being handled in the rewriting rules I gave) :

  • #, hex code : 0x23
  • ², hex code : 0xb2
  • , hex code : 0x1d50 -> already discussed
  • ̍, hex code : 0x30d -> already discussed
  • ˀ, hex code : 0x2c0
  • ɚ, hex code : 0x25a
  • ɝ, hex code : 0x25d
  • ǧ, hex code : 0x1e7
  • g, hex code : 0x67
  • ĝ, hex code : 0x11d
  • , hex code : 0x1d5d -> already discussed
  • ʶ, hex code : 0x2b6
  • :, hex code : 0x3a
  • , hex code : 0x2040 -> already discussed
  • ͈, hex code : 0x348 -> already discussed
  • ğ, hex code : 0x11f
  • ɫ, hex code : 0x26b
  • ̑, hex code : 0x311 -> already discussed
  • ʱ, hex code : 0x2b1
  • -, hex code : 0x2d
  • , hex code : 0x20
  • ʤ, hex code : 0x2a4

I'd like to add that I think \u2040 is a linking more than a tie (as it is between two words). Hence, changing it to the regular linking symbol would be nice (IPA provide ties above and below, but only below linking from what I see).

Also, letters with g might be flagged because I excluded the classic g from admitted symbols. Usually, those with accents would also be caught by NFD + rewriting rules.

I can also run the script without NFD if you want, to see which symbols might have to be replaced by their unicode decomposition.

RobinSobczyk avatar Mar 11 '25 09:03 RobinSobczyk

Hi, Can I help in any way to solve and close this issue ? Like providing a (python) script to apply rewriting rules or anything ?

RobinSobczyk avatar Mar 31 '25 12:03 RobinSobczyk

@RobinSobczyk I have gone through the list you provided and made some changes -- further details for each point are below:

  • [x] #, hex code : 0x23
    • all in is.txt, seem to be secondary stress markers (now removed)
  • [ ] ², hex code : 0xb2
    • all in sv.txt, represents tone pitch pattern 2
  • [ ] ᵐ, hex code : 0x1d50` already discussed
    • all in sv.txt, represents prenasalized labial consonant
  • [ ] ̍, hex code : 0x30d already discussed
    • all in de.txt, marks syllabicity on a letter with a descender (e.g., ŋ̍
  • [ ] ˀ, hex code : 0x2c0
    • all in vi_C.txt and vi_N.txt, represents glottalized consonant
  • [ ] ɚ, hex code : 0x25a
    • appears in zh.txt and de.txt, represents R-colored vowel
  • [ ] ɝ, hex code : 0x25d
    • appears only in en_US, represents R-colored vowel
  • [ ] ǧ, hex code : 0x1e7
    • not found in any file
  • [ ] g, hex code : 0x67
    • loop-tail g, acceptable graphic variant of open-tail ɡ in IPA
  • [ ] ĝ, hex code : 0x11d
    • only appears in fa.txt, should perhaps be replaced by voiced uvular stop ɢ, but this should be confirmed by someone with expertise in Persian phonology before changing
  • [ ] , hex code : 0x1d5d already discussed
    • all in ja.txt, represents compressed unrounded vowel
  • [ ] ʶ, hex code : 0x2b6
    • only found in de.txt, represents uvularization
  • [ ] :, hex code : 0x3a
    • present in many files; in all cases this is standing in for the IPA triangular colon ː representing a long vowel or geminated consonant
    • in principle, these could be search and replaced with ː, but as with the loop-tail g above, this may or may not be necessary and makes the files a little easier to work with (however, comment from others using the data files is welcomed)
  • [ ] , hex code : 0x2040 already discussed
  • [ ] ͈, hex code : 0x348 already discussed
    • all in ko.txt, represents faucalized voice in Korean
  • [x] ğ, hex code : 0x11f
    • only one instance, in fa.txt -- replaced with ɢ (voiced uvular plosive)
  • [ ] ɫ, hex code : 0x26b
    • used in de.txt and en_US.txt, represents voiced velarized alveolar approximant
  • [ ] ̑, hex code : 0x311 already discussed
    • all in de.txt, this is an inverted breve used to represent non-syllabic vowels
  • [ ] ʱ, hex code : 0x2b1
    • all in or.txt, represents voiced aspirated consonants in Odia
  • [x] -, hex code : 0x2d
    • used in a number of files, mostly to separate syllables
    • some could be deleted easily such as ma.txt
    • others could be removed by search and replace such as vi-*.txt, but the syllable separation seems to be useful and losing it would be less useful
    • also present in de.txt where it's a bit trickier to remove -- these will have to be corrected one entry at a time
      • used for a number of purposes in de.txt:
        • to indicate consonant gemination (this has now been replaced with ː after the initial consonant)
        • to indicate word boundaries even if the original orthography did not include a hyphen (for example Jiu Jitsu) -- these have now been removed
        • to indicate initial identical part of a pronunciation when there are two or more in total (this has now been fixed)
        • replication of the usage in the headword (for example, when headword is a prefix -- these have now been fixed)
  • [ ] ``, hex code : 0x20
    • not found in any file
  • [x] ʤ, hex code : 0x2a4
    • deprecated digraph representing various kinds of voiced alveolar affricate
    • has been replaced by d͡ʒ but is still widely used since the tie bar is annoying in practice
    • appears mostly in nl.txt (many examples) and de.txt (just once)
    • these have now been replaced in all files

As you can see from the list above, it seems that your script may be flagging some characters that are in fact acceptable in IPA, or useful for users of the data. As noted, these have not been changed.

dohliam avatar May 24 '25 06:05 dohliam

Hi ! Thanks for the update, I updated my code to report as well the files in which the symbol are found. I grouped them depending on what I thought about them :

IPA uniformizations that sound straightforward to me :

  • g, hex code : 0x67 ['es_ES.txt', 'fr_QC.txt', 'sv.txt', 'ma.txt', 'fr_FR.txt', 'jam.txt', 'eo.txt', 'or.txt', 'tts.txt', 'es_MX.txt', 'nb.txt', 'sw.txt', 'ja.txt', 'pt_BR.txt', 'fi.txt', 'ro.txt', 'km.txt']
    • I know it's an acceptable variant of loop tail g, but I think it's nice to uniformize them throughout the files when the standard is given by the IPA
  • :, hex code : 0x3a ['jam.txt', 'yue.txt', 'tts.txt']
    • same as for g, uniformizing and make everything follow IPA to the closest might be nice
  • ɚ, hex code : 0x25a ['zh_hans.txt', 'zh_hant.txt', 'de.txt']
    • can be rewritten with ə and ˞ to follow IPA
  • ɝ, hex code : 0x25d ['en_US.txt']
    • can be rewritten with ɜ and ˞ to follow IPA
  • ɫ, hex code : 0x26b ['en_US.txt', 'de.txt']
    • can be rewritten with l and ̴ (the overlay tilde which correspond to velarized)
  • , hex code : 0x2040 ['de.txt']
    • can be replaced with undertie to match IPA linking symbol

Questions :

  • ², hex code : 0xb2 ['sv.txt']
    • can it be written with the tone symbols from IPA ?
  • -, hex code : 0x2d ['vi_C.txt', 'ma.txt', 'vi_S.txt', 'or.txt', 'ko.txt', 'vi_N.txt', 'sw.txt']
    • can't it be replaced by . as it is what IPA provide for syllable break ?
  • ʱ, hex code : 0x2b1 ['or.txt']
    • can't voiced aspirated be made by combining the IPA symbols ̬ for voiced and ʰ for aspirated ?
  • ̑, hex code : 0x311 ['de.txt']
    • to me it seems that it's always used on y, so it might fall in the case of false positive because it's just a letter with a descender ?

Might not be solvable (at least with current IPA) :

  • ˀ, hex code : 0x2c0 ['vi_C.txt', 'vi_N.txt']
    • might not be expressible otherwise
  • , hex code : 0x1d50 ['sw.txt']
    • might not be expressible otherwise
  • , hex code : 0x1d5d ['ja.txt']
    • might not be expressible otherwise
  • ʶ, hex code : 0x2b6 ['de.txt']
    • might not be expressible otherwise
  • ͈, hex code : 0x348 ['ko.txt']
    • no possible alternative
  • ĝ, hex code : 0x11d ['fa.txt']
    • require more expertise

False positive :

  • , hex code : 0x20 ['ko.txt', 'fi.txt', 'vi_C.txt', 'fr_QC.txt', 'jam.txt', 'zh_hans.txt', 'is.txt', 'sv.txt', 'km.txt', 'zh_hant.txt', 'de.txt', 'en_UK.txt', 'fr_FR.txt', 'eo.txt', 'nb.txt', 'yue.txt', 'vi_S.txt', 'vi_N.txt', 'ja.txt', 'fa.txt']
    • really just a space, I think it is flagged because of multi word pronunciations so let's ignore it
  • ̍, hex code : 0x30d ['de.txt']
    • used when placed with descenders, which is IPA compliant

In the end, I think that having something that is as close as possible is important so the dataset can be easily used through code. Also, I think it might be interesting to keep somewhere a note about the non-IPA symbols that where kept and what they mean so people like me with not that big pronunciation knowledge don't wonder why those symbols are here. README could be the perfect place to be honest. That way it is also easy to keep track of it if one day IPA evolves in a way that include them, with potentially different symbols.

Thank you again for your work and those updates !

RobinSobczyk avatar May 27 '25 12:05 RobinSobczyk