Standardize `PronunciationKind` to BCP 47 tags
The current definition is too limiting and the Other variant doesn't give enough structure.
I think we should instead use BCP 47 tags with Region, Script, and Variant and the transformation extension t specified (when relevant). When in doubt, consult Unicode and IANA for clarification, more specifically, the IANA language subtag registery, and the Unicode CLDR data
The following table gives the changes for the current schema
| Old | New |
|---|---|
| IPA | und-fonipa[^1] |
| Pinyin | zh-Latn-pinyin |
| Hiragana | ja-Hira |
| Katakana | ja-Kana |
| Romaji | ja-Latn |
| Yale | ko-Latn*[^2] |
| Jyutping | yue-jyutping |
| Bopomofo | zh-Bopo or yue-Bopo |
| Hepburn | ja-Latn-hepburn or ja-Latn-alalc97[^3] |
In case the language is unknown (unlikely), or it's a purely typographic, use the und language tag[^1]
Example
For example, see this example from the codebase.
https://github.com/TheOpenDictionary/odict/blob/094dee15cdf2e3673652a6c6b4df355191d520bd/python/tests/test_pronunciation.py#L118-L135
Currently, both entries are encoded using the ipa kind which is ambiguous. With BCP 47 tags, this would be
<dictionary>
<entry term="hello">
<ety>
- <pronunciation kind="ipa" value="həˈləʊ">
+ <pronunciation kind="en-GB-fonipa" value="həˈləʊ">
<url src="./hello-british.mp3" />
</pronunciation>
- <pronunciation kind="ipa" value="hɛˈloʊ">
+ <pronunciation kind="en-fonipa" value="hɛˈloʊ">
<url src="./hello-american.mp3" />
</pronunciation>
<sense pos="adj">
<definition value="A greeting" />
</sense>
</ety>
</entry>
</dictionary>
Modeling the difference in the two pronunciations.
Furthermore, this example https://github.com/TheOpenDictionary/odict/blob/094dee15cdf2e3673652a6c6b4df355191d520bd/examples/pronunciation_example.xml#L78-L80
Stops becoming custom, it's just encoded under zh-Latn-wadegile!
This automatically makes the schema more extensible, supporting more languages/systems without having to touch the codebase.
Rationale
Other than supporting more languages out of the gate, adopting this system would allow odict clients to perform Unicode Transformations robustly. Let's take for example, https://github.com/TheOpenDictionary/odict/blob/094dee15cdf2e3673652a6c6b4df355191d520bd/examples/pronunciation_example.xml#L44-L46
Suppose that I am a Chinese dictionary user who prefers to use numbers for tones instead of accents.
If the pronunciation was annotated with zh-Latn-pinyin, my client can use the ICU Latin-NumericPinyin transform to automatically transform the pronunciation to numeric form. i.e. Ni3 hao3, ren4shi ni3 hen3 gao1xing2
Furthermore, this system will help facilitate typesetting ruby by allowing mixed Hiragana Katakana pronunciation using ja-Hrkt. See japanese-furigana-normalize for more info.
Change
For backwards compatibility, we should alias the previous definitions to the current one. Furthermore, I think we should warn when we encounter an Other that is not a valid BCP 47 tag (and optionally maybe when we encounter und for the language[^1]). If we're willing to make a breaking change, then perhaps the following schema is best:
pub enum PronunciationKind {
Bcp47(icu_locale_core::LanguageIdentifier), // or use String if you don't want the dependency
#[strum(to_string = "{0}")]
#[serde(untagged)]
Other(String),
}
Update (Oct 1)
I think it's better to just get rid of PronunciationKind and just use icu_locale_core::LanguageIdentifier (or String) instead. If you had any data that didn't fit exactly into the BCP 47 tag, you could just stuff it into the private modifier just like handling Yale[^2]
[^1]: It's best to specify the language, as in the example above, since IPA notation changes from language to language.
[^2]: Korean is in a more peculiar place, it seems that ko-Latn-alalc97 (Modified McCune–Reischauer) is the only variant specified in Unicode CLDR. According to the CLDR, the default transform for ko-Latn is the Revised Romanization of Korean (RR) (also indicated by transform flag -t-m0-mcst or -t-m0-bgn) which suggests that should be the default intrepretation for ko-Latn. Unfortuately, this means we have to model Yale ambiguously with RR under ko-Latn, or use a private extension (e.g. ko-Latn-x-yale). Funnily enough ko-KP-Latn would most likely model the Romanization of Korea system.
[^3]: You should almost always use ja-Latn-alalc97 (Library of Congress) as that is what's commonly used. If you're sure you're using Traditional Hepburn, use ja-Latn-hepburn. See wikipedia page for the differences.
Hey @Waelwindows! Sorry I'm a bit late to reviewing this. Very interesting proposal! I think my main concern here is primarily around our ability to control how the serialization of this type will work, especially if it exists in a crate we don't own. For example, I'm not a huge fan of the fact that some of these are mixed case, as the convention as far as XML has been to keep things lowercase or case insensitive. I do think holistically it could be a welcome change, but I am curious your thoughts about that. The other consideration is how this type will be serialized to binary, as ODict's type all have rkyv type annotations that allow them to exist as Archived* types as well.
The other choice, as you've highlighted, is just using a string, but then we may still run into casing concerns if the iso crate is case-sensitive. My goal with ODXML has been to try to keep it somewhat intuitive and easy to use.
Actually, thinking on this a bit more – it looks like the icu_locale_core breaks LanguageIdentifier into language, region, script, and variants. It also looks like it may be inherently case-insensitive. What are your thoughts about just having these as separate attributes on the
Thanks for reviewing!
What are your thoughts about just having these as separate attributes on the block for readability?
I think it's better if we stick to one value rather than decompose it. BCP47 is an established standard used in the web and should be familiar to anybody using HTML. Having it intact also bolsters interoperability since other tools speak it. (I could immediately use CLDR transformations without having to reconstruct the BCP47 tag). Furthermore, it's simply more compact!
Here's an example from English–Chinese Wikitionary that is compact in BCP47 form
<entry term="大姊" rank="20673">
<ety>
<pronunciation kind="yue-Latn-jyutping" value="daai⁶ zi²"/>
<pronunciation kind="nan-Latn-pehoeji" value="tōa-chí"/>
<pronunciation kind="nan-Latn-pehoeji" value="tōa-ché"/>
<pronunciation kind="yue-Latn-jyutping" value="daai⁶ zi²"/>
<pronunciation kind="yue-Latn-x-yale" value="daaih jí"/>
<pronunciation kind="yue-Latn-pinyin" value="daai⁶ dzi²"/>
<pronunciation kind="yue-Latn-GD" value="dai⁶ ji²"/>
<pronunciation kind="yue-Latn-fonipa" value="/taːi̯²² t͡siː³⁵/"/>
<pronunciation kind="nan-Latn-pehoeji" value="tōa-chí"/>
<pronunciation kind="nan" value="tuā-tsí"/>
<pronunciation kind="nan" value="doaxcie"/>
<pronunciation kind="nan-Latn-TW-fonipa" value="/tua³³⁻¹¹ t͡si⁵³/"/>
<pronunciation kind="nan-Latn-fonipa" value="/tua²²⁻²¹ t͡si⁵³/"/>
<pronunciation kind="nan-Latn-fonipa" value="/tua⁴¹⁻²² t͡si⁵⁵⁴/"/>
<pronunciation kind="nan-Latn-fonipa" value="/tua³³⁻²¹ t͡si⁴¹/"/>
<pronunciation kind="nan-Latn-pehoeji" value="tōa-ché"/>
<pronunciation kind="nan" value="tuā-tsé"/>
<pronunciation kind="nan" value="doaxzea"/>
<pronunciation kind="nan-Latn-TW-fonipa" value="/tua³³⁻¹¹ t͡se⁵³/"/>
<pronunciation kind="nan-Latn-fonipa" value="/tua³³⁻²¹ t͡se⁴¹/"/>
</ety>
</entry>
Breaking down one entry to
<pronunciation lang="nan" script="Latn" region="TW" value="/tua³³⁻¹¹ t͡si⁵³/">
<variant value="fonipa" />
</pronunciation>
Is too much in my opinion and wouldn't compose well the other extensions. ~~Funnily enough, you can see that even standard BCP47 aren't enough to encode Wikitionary.~~
As for serialization, we could just keep the wire format the BCP47 string itself. As you noted, lower-caseing is fine although a bit unconventional; Furthermore, you could even switch up the order of the tags within reason. You could go crazy with the encoding if you want, but string interning should give you almost all of the space savings if you're concerned. (looks like rkyv supports that too)
So I started a PR to investigate a solution here – something that occurred to me, however, is that because region, script and variants are optional for LanguageIdentifier, I feel as though they would no longer accurately describe the kind of pronunciation. For example, in the snippet you shared:
<pronunciation kind="nan" value="tuā-tsí"/>
<pronunciation kind="nan" value="doaxcie"/>
both pronunciations are Min Nan, though the romanization system is different (from what I can tell). Shouldn't the kind hold a stronger opinion in communicating the phonetic system?
LanguageIdentifier could definitely be helpful in defining the language of a Dictionary or Entry, but I worry that out-of-the-box the type wouldn't be doing enough to describe the kind of pronunciation system, just its language.
For example, in the snippet you shared
Pardon me for the example. It's from a work-in-progress change for https://github.com/TheOpenDictionary/dictionaries.
You are right that both entries belong to nan (Actually Hokkien), being Tai-lo and Phofsit-Daibuun romanization respectively.
Furthermore, the entries
<pronunciation kind="nan-Latn-TW-fonipa" value="/tua³³⁻¹¹ t͡si⁵³/"/>
<pronunciation kind="nan-Latn-fonipa" value="/tua²²⁻²¹ t͡si⁵³/"/>
<pronunciation kind="nan-Latn-fonipa" value="/tua⁴¹⁻²² t͡si⁵⁵⁴/"/>
<pronunciation kind="nan-Latn-fonipa" value="/tua³³⁻²¹ t͡si⁴¹/"/>
correspond to
{"ipa": "/tua³³⁻¹¹ t͡si⁵³/", "raw_tags": ["General Taiwanese"], "tags": ["Min-Nan", "Hokkien", "Xiamen", "Quanzhou", "Zhangzhou", "IPA", "Taipei"]},
{"ipa": "/tua²²⁻²¹ t͡si⁵³/", "raw_tags": ["General Taiwanese"], "tags": ["Min-Nan", "Hokkien", "Xiamen", "Quanzhou", "Zhangzhou", "IPA"]},
{"ipa": "/tua⁴¹⁻²² t͡si⁵⁵⁴/", "raw_tags": ["General Taiwanese"], "tags": ["Min-Nan", "Hokkien", "Xiamen", "Quanzhou", "Zhangzhou", "IPA"]},
{"ipa": "/tua³³⁻²¹ t͡si⁴¹/", "raw_tags": ["General Taiwanese"], "tags": ["Min-Nan", "Hokkien", "Xiamen", "Quanzhou", "Zhangzhou", "IPA", "Kaohsiung"]},
which also indicate regional sub-dialects in Fujian that unicode doesn't encode directly.
Rendered here from Wikitionary (ambigious json? maybe kaikki extraction error?).
Unfortunately due to geopolitcal reasons, Hokkien isn't extensively encoded in unicode. For now we can remedy this using a private extension scheme till unicode catches up (e.g. nan-Latn-x-ortho-tailo and nan-Latn-x-ortho-psdb).
However, this isn't an argument to not use BCP47. You can't and shouldn't force users to specify the script since sometimes the source is unknown or too niche. For example, how would I specify Webster's 1913 phonetic system? The entry for "pathos" is pā′thŏs. it's some kind of en(-Latn) but it's not codified anywhere except in his dictionary).
The tags system should be best effort, the dictionary authors should be able to specify with as much detail as they want, and the BCP 47 system should allow graceful fallback rules. (e.g. my system may not know how to handle nan-TW but it can handle it using general nan rules).
Perhaps we should have an optional note attribute? This can help disambiguate using formal/informal pronunciation and other stuff that isn't as easily represent-able in BCP 47 tags (we should still prefer BCP 47 as we can localize that).
Hmm, I getting around to the idea of having multiple IDs in place with BCP 47 being the main id. Maybe something like the following:
<pronunciation bcp47="nan-Latn-TW-fonipa" value="/tua³³⁻¹¹ t͡si⁵³/">
<id qualifier="region" source="wikidata" value="Q1867">Taibei</id>
<id qualifier="language" source="glottolang" value="taib1242">Taibei Hokkien</id>
</pronunciation>
<pronunciation bcp47="nan-Latn-fonipa" value="/tua²²⁻²¹ t͡si⁵³/">
<id qualifier="region" source="wikidata" value="Q68744">Xiamen</id>
<id qualifier="region" source="wikidata" value="Q68814">Zhangzhou</id>
<id qualifier="language" source="wikidata" value="Q2705752">Xiamen</id>
<id qualifier="language" source="wikidata" value="Q8070492">Zhangzhou</id>
<id qualifier="language" source="glottolang" value="xiam1236">Xiamen</id>
<id qualifier="language" source="glottolang" value="fuji1236">Zhangzhou</id>
</pronunciation>
Note the value of the id tag is a suggested rendering that clients may use to display the id.
This also solves our previous issue of representing orthography systems as any system described in Wikipedia also has a corresponding Wikidata ID IIUC.
Thus, Korean Yale romanization is simply encoded below
<entry term="영어">
<ety>
<pronunciation bcp47="ko-Latn" value="yenge">
<id qualifier="transliteration" source="wikidata" value="Q16256856">Yale Romanization</id>
</pronunciation>
</ety>
</entry>
I'm envisioning this system as a supplement to the BCP 47 tag idea. As in, whatever you cannot represent there should be placed in these ids
Hm this is interesting! I definitely like the idea of an <id> tag to link out to additional resources seeing these could also be used for things like entries as well. Though seeing ODict blocks can technically accept id attributes (to denote unique entries within a dictionary) I worry these two concepts might get conflated. Curious what your thoughts are on that. I also think maybe calling the attribute bcp47 is better than kind because it's at least explicit about what the ID represents if we are not splitting them into separate fields. I can update my PR to reflect these changes when I have a spare moment.
On the topic of standardization, I'm also curious your thoughts on this similar issue I just opened discussing UD: #1332