CLDR-17115 Update languages/codes
-Numerous changes based on following instructions in Update Language/Script/Region Subtags
-Error summary (running locally):
-
CLDR/TestLocale/testLanguageTagParserIsValid
-
CLDR/TestSupplementalInfo/TestMacrolanguages
-
CLDR/TestValidity/TestCompatibility
-
CLDR/TestValidity/TestLstrConsistency
CLDR-17115
- [ ] This PR completes the ticket.
ALLOW_MANY_COMMITS=true
@macchiati as you requested, I reverted scripts.xml in the last commit
Per discussion, I've reverted iso_3166_status.txt to main
It looks like there are a surprising number of errors. I think it is best for me to walk you through this, and you can capture these notes in the instructions.
It appears that ISO had an unexpected number of deprecations, so you're seeing more issues that we normally see.
For lines like the following: Error: (TestLocale.java:921) Error: : ajp: expected "", got "Disallowed language=ajp, status=deprecated" Error: (TestLocale.java:927) Error: : ajp_Arab_JO: expected "", got "Disallowed language=ajp, status=deprecated
What is happening is that likelySubtags.xml is handling languages that are now deprecated. That is to be expected, because ISO does that occasionally, but because we added a lot of SIL language data, the number may be larger each year. To fix that, go to the file an delete the line where it is handled, and delete that line, in this case:
<likelySubtag from="ajp" to="ajp_Arab_JO" origin="sil1"/>
TestMacrolanguages Error: (TestSupplementalInfo.java:1328) Error: Macrolanguage sa Sanskrit Historical
It looks like the classification changed in ISO. We still use 'sa', because the India government disagrees that it is only historical!Add to if (language.equals("no") || language.equals("sh")) continue; // special cases
TestCompatibility Error: (TestValidity.java:284) Error: language:dzd:deprecated => regular // add to exception list (ALLOWED_UNDELETIONS) if really un-deprecated
Check the diff in the iso-639 files to verify that dzd is really de-deprecated. Then add dzd to ALLOWED_UNDELETIONS
The "ERROR:" values below in the listing all look like keyboard stuff; I don't think those are counted. I'll file a ticket for Steven to clean those up.
Good work. That verifies that it is indeed an intentional change.
On Thu, Mar 14, 2024 at 8:26 AM Tom Bishop @.***> wrote:
@.**** commented on this pull request.
In common/validity/language.xml https://github.com/unicode-org/cldr/pull/3538#discussion_r1525078910:
baz bbz bcc bcl bgm bh bhk bic bij bjd bjq bkb blg bmy bpb btb btl bxk bxr bxx byycbe cbh cca ccq cdg cjr cka cld cmk cmn cnr coy cqu cug cum cwd
daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl dzd
daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl"dzd" was removed here
Web search for "dzd deprecated" turns up this file:
https://www.iana.org/assignments/lang-subtags-templates/dzd-2023-03-17.txt
which reads as follows:
FOR ARCHIVING: Registration form for 'dzd'
LANGUAGE SUBTAG REGISTRATION FORM
Name of requester: Doug Ewell 2.
E-mail address of requester: doug at ewellic.org 3.
Record Requested:
Type: language Subtag: dzd Description: Daza
Intended meaning of the subtag: 2.
Reference to published description of the language (book or article): 3.
Any other relevant information:
This registration tracks a change made to ISO 639-3 effective 2023-01-20, adding the code element 'dzd' for Daza, which had been retired in 2015 as non-existent. The net effect of this registration is to remove the Deprecated value from this record.
For more information on the ISO 639-3 change, refer to: https://iso639-3.sil.org/request/2022-027
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#pullrequestreview-1937046932, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMCNSFWGIOHOFUGAZA3YYG6T5AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSMZXGA2DMOJTGI . You are receiving this because you were mentioned.Message ID: @.***>
@macchiati my latest commit fixes the problems with sa (Sanskrit) and dzd (Daza). It does not fix the problems with ajp and others in this output:
testLanguageTagParserIsValid {
Error: (TestLocale.java:921) : ajp: expected "", got "Disallowed language=ajp, status=deprecated"
Error: (TestLocale.java:927) : ajp_Arab_JO: expected "", got "Disallowed language=ajp, status=deprecated"
Error: (TestLocale.java:921) : kgm: expected "", got "Disallowed language=kgm, status=deprecated"
Error: (TestLocale.java:927) : kgm_Latn_BR: expected "", got "Disallowed language=kgm, status=deprecated"
Error: (TestLocale.java:921) : ksa: expected "", got "Disallowed language=ksa, status=deprecated"
Error: (TestLocale.java:927) : ksa_Latn_NG: expected "", got "Disallowed language=ksa, status=deprecated"
Error: (TestLocale.java:921) : nom: expected "", got "Disallowed language=nom, status=deprecated"
Error: (TestLocale.java:927) : nom_Latn_PE: expected "", got "Disallowed language=nom, status=deprecated"
Error: (TestLocale.java:921) : plj: expected "", got "Disallowed language=plj, status=deprecated"
Error: (TestLocale.java:927) : plj_Latn_NG: expected "", got "Disallowed language=plj, status=deprecated"
Error: (TestLocale.java:921) : prp: expected "", got "Disallowed language=prp, status=deprecated"
Error: (TestLocale.java:927) : prp_Gujr_IN: expected "", got "Disallowed language=prp, status=deprecated"
Error: (TestLocale.java:921) : slq: expected "", got "Disallowed language=slq, status=deprecated"
Error: (TestLocale.java:927) : slq_Arab_IR: expected "", got "Disallowed language=slq, status=deprecated"
Error: (TestLocale.java:921) : szd: expected "", got "Disallowed language=szd, status=deprecated"
Error: (TestLocale.java:927) : szd_Latn_MY: expected "", got "Disallowed language=szd, status=deprecated"
Error: (TestLocale.java:921) : tmk: expected "", got "Disallowed language=tmk, status=deprecated"
Error: (TestLocale.java:927) : tmk_Deva_NP: expected "", got "Disallowed language=tmk, status=deprecated"
Error: (TestLocale.java:921) : xss: expected "", got "Disallowed language=xss, status=deprecated"
Error: (TestLocale.java:927) : xss_Cyrl_RU: expected "", got "Disallowed language=xss, status=deprecated"
Error: (TestLocale.java:921) : zkb: expected "", got "Disallowed language=zkb, status=deprecated"
Error: (TestLocale.java:927) : zkb_Cyrl_RU: expected "", got "Disallowed language=zkb, status=deprecated"
Error: (TestLocale.java:921) : zua: expected "", got "Disallowed language=zua, status=deprecated"
Error: (TestLocale.java:927) : zua_Latn_NG: expected "", got "Disallowed language=zua, status=deprecated"
You addressed these errors in your last comment, but I still don't understand; they're different from the "sa" error.
"ajp" occurs in languageGroup.xml, languageInfo.xml, and likelySubtags.xml. Should it be deleted from languageInfo.xml, and/or likelySubtags.xml, and then should languageGroup.xml be regenerated?
Here is what to do in more detail.
Case 1, replaced by old:
Take ajp
Look at language-subtag-registry (the diff from the old one)
You see that ajp has 2 items added:
Deprecated: 2023-03-17
Preferred-Value: apc
That means that wherever it occurs, "apc" should be substituted. However, if you look at apc, it is not new. So the actions are to delete it in those files where it occurs. Search the directory supplemental. You find:
languageGroup.xml
93: <languageGroup parent="sem">aao abh acm acq acy aeb aec agj aii ajp akk am …</languageGroup>
languageInfo.xml
170: <languageMatch desired="ajp" supported="ar" distance="10" oneway="true"/> <!-- South Levantine Arabic -->
likelySubtags.xml (6 matches)
2,883: <likelySubtag from="ajp" to="ajp_Arab_JO" origin="sil1"/> <!-- South Levantine Arabic ➡︎ South Levantine Arabic (Arabic, Jordan) -->
4,461: <likelySubtag from="gra" to="gra_Deva_IN" origin="sil1"/> <!-- Rajput Garasia ➡︎ Rajput Garasia (Devanagari, India) -->
4,462: <likelySubtag from="gra_Gujr" to="gra_Gujr_IN" origin="sil1"/> <!-- Rajput Garasia (Gujarati) ➡︎ Rajput Garasia (Gujarati, India) -->
In languageGroup: If 'apc', didn't exist in that file you would replace it. Since it does, you just delete it (leaving the rest of the line alone).
93: <languageGroup parent="sem">aao abh acm acq acy aeb aec agj aii akk am …</languageGroup>
Same in languageInfo.xml and likelySubtags.xml. 'apc' exists in each, so just delete the lines.
Suppose it were in supplementalData in the territory information (it doesn't so this is just illustration!!)
<territory type="PS" gdp="21220000000" literacyPercent="95.3" population="4818260"> <!--Palestinian Territories-->
<languagePopulation type="ar" populationPercent="100" officialStatus="official"/> <!--Arabic-->
<languagePopulation type="apc" populationPercent="87" references="R1173"/> <!--Levantine Arabic-->
<languagePopulation type="ajp" populationPercent="2" references="..."/> <!--South Levantine Arabic-->
In that case you would combine the two figures to get:
<languagePopulation type="apc" populationPercent="89" references="R1173"/> <!--Levantine Arabic-->
Use your judgment: sometimes language counts are doubled for bilingual speakers, so if it adds to a crazy amount, don't add it. (These figures are 'best available', so that's ok.)
Case 2, no preferred
In this case, just drop the lines.
Case 3, split
Subtag: ksa
Description: Shuwa-Zamani
Added: 2009-07-29
Deprecated: 2023-03-17
Comments: see izm, rsw
Look at iso-639-3_Retirements.tab for ksa
You'll see "Split into [rsw] Rishiwa and [izm] Kizamani"
Take the first one, and treat this case like Case 1.
@macchiati I've started to follow your directions for "ajp", ...
likelySubtags.xml says "Likely subtags data is generated programatically from CLDR's language/territory/population data using the GenerateMaximalLocales tool. Under normal circumstances, this file should not be patched by hand, as any changes made in that fashion may be lost."
So I tried to run GenerateMaximalLocales and got "IllegalArgumentException: Don't run this tool until it is fixed":
public static void main(String[] args) throws IOException {
if (true) {
throw new IllegalArgumentException("Don't run this tool until it is fixed");
}
So I'll try hand-editing likelySubtags.xml anyway...
Right, we disabled the tool for now. It should be easy to regex-search for (ajp|...) to find all the lines, although you want to look at each one rather than automatically deleting.
On Tue, Mar 19, 2024 at 7:54 AM Tom Bishop @.***> wrote:
@macchiati https://github.com/macchiati I've started to follow your directions for "ajp", ...
likelySubtags.xml says "Likely subtags data is generated programatically from CLDR's language/territory/population data using the GenerateMaximalLocales tool. Under normal circumstances, this file should not be patched by hand, as any changes made in that fashion may be lost."
So I tried to run GenerateMaximalLocales and got "IllegalArgumentException: Don't run this tool until it is fixed":
public static void main(String[] args) throws IOException { if (true) { throw new IllegalArgumentException("Don't run this tool until it is fixed"); }So I'll try hand-editing likelySubtags.xml anyway...
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#issuecomment-2007398255, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMHOM4XOIU3I5RLCXVLYZBGTTAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGM4TQMRVGU . You are receiving this because you were mentioned.Message ID: @.***>
@macchiati FYI you wrote that iso-639-3_Retirements.tab says "Split into [rsw] Rishiwa and [izm] Kizamani" but the version I'm seeing (in the branch for this ticket) doesn't say anything like that -- because that file is changed in this PR! So I need to look at the version of that file before this PR. Just something to be aware of when we update the instructions...
Right. What I do is look at the diffs in the PR.
BTW, as you go through this, please jot down in a doc or text file what you are doing, so that we can use that as a basis for updating the instructions.
On Tue, Mar 19, 2024 at 8:22 AM Tom Bishop @.***> wrote:
@macchiati https://github.com/macchiati FYI you wrote that iso-639-3_Retirements.tab says "Split into [rsw] Rishiwa and [izm] Kizamani" but the version I'm seeing (in the branch for this ticket) doesn't say anything like that -- because that file is changed in this PR! So I need to look at the version of that file before this PR. Just something to be aware of when we update the instructions...
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#issuecomment-2007479687, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMDPXCMZ3O6VH5SSIQ3YZBJ4DAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGQ3TSNRYG4 . You are receiving this because you were mentioned.Message ID: @.***>
@macchiati these two files disagree on the replacement for prp (Parsi), whether to change to gu or guj:
language-subtag-registry: Type: language Subtag: prp Description: Parsi Added: 2009-07-29 Deprecated: 2023-03-17 Preferred-Value: gu
iso-639-3_Retirements.tab: prp Parsi M guj 2023-01-20
Since your comments mainly refer to language-subtag-registry I'm guessing "gu", but it's just a wild guess so please confirm or correct!
Actually likelySubtags.xml already has
<likelySubtag from="gu" to="gu_Gujr_IN"/>
<!--{ Gujarati; ?; ? } => { Gujarati; Gujarati; India }-->
So I'm just deleting the prp line from that file
I think it should be "gu". "guj" is the ISO 639-3 equivalent of "gu". The ISO 639-1 (two-letter) code is preferred if it exists.
gu is the right choice. (guj is the 3 letter code, but the BCP47 uses 2 letter whenever it exists)
On Tue, Mar 19, 2024 at 9:10 AM Tom Bishop @.***> wrote:
@macchiati https://github.com/macchiati these two files disagree on the replacement for prp (Parsi), whether to change to gu or guj:
language-subtag-registry: Type: language Subtag: prp Description: Parsi Added: 2009-07-29 Deprecated: 2023-03-17 Preferred-Value: gu
iso-639-3_Retirements.tab: prp Parsi M guj 2023-01-20
Since your comments mainly refer to language-subtag-registry I'm guessing "gu", but it's just a wild guess so please confirm or correct!
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#issuecomment-2007591568, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMB54DNLDNEZGRLZIH3YZBPOXAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGU4TCNJWHA . You are receiving this because you were mentioned.Message ID: @.***>
Another disagreement, for szd -- replace with uki or umi?
iso-639-3_Retirements.tab: szd Seru M uki 2023-01-20
language-subtag-registry: Type: language Subtag: szd Description: Seru Added: 2009-07-29 Deprecated: 2023-03-17 Preferred-Value: umi
likelySubtags.xml has only umi, not uki; I'm just deleting szd from that file
When in doubt, go by the language subtag registry
On Tue, Mar 19, 2024 at 10:16 AM Tom Bishop @.***> wrote:
Another disagreement, for szd -- replace with uki or umi?
iso-639-3_Retirements.tab: szd Seru M uki 2023-01-20
language-subtag-registry: Type: language Subtag: szd Description: Seru Added: 2009-07-29 Deprecated: 2023-03-17 Preferred-Value: umi
likelySubtags.xml has only umi, not uki; I'm just deleting szd from that file
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#issuecomment-2007725129, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMDCTEQ6JATO4AO6BCLYZBXH3AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXG4ZDKMJSHE . You are receiving this because you were mentioned.Message ID: @.***>
I made another commit. Locally there's a new set of errors, which I'll work on next:
TestLstrConsistency {
Error: (TestValidity.java:537) Missing aliases for supplementalMetadata: 10
<languageAlias type="ajp" replacement="apc" reason="deprecated"/> <!-- South Levantine Arabic ⇒ Levantine Arabic -->
<languageAlias type="kgm" replacement="plu" reason="deprecated"/> <!-- Karipúna ⇒ Palikúr -->
<languageAlias type="nom" replacement="cbr" reason="deprecated"/> <!-- Nocamán ⇒ Cashibo-Cacataibo -->
<languageAlias type="pmk" replacement="crr" reason="deprecated"/> <!-- Pamlico ⇒ Carolina Algonquian -->
<languageAlias type="prp" replacement="gu" reason="deprecated"/> <!-- Parsi ⇒ Gujarati -->
<languageAlias type="szd" replacement="umi" reason="deprecated"/> <!-- Seru ⇒ Ukit -->
<languageAlias type="tmk" replacement="tdg" reason="deprecated"/> <!-- Northwestern Tamang ⇒ Western Tamang -->
<languageAlias type="tpw" replacement="tpn" reason="deprecated"/> <!-- Tupí ⇒ Tupinambá -->
<languageAlias type="xss" replacement="zko" reason="deprecated"/> <!-- Assan ⇒ Kott -->
<languageAlias type="zkb" replacement="kjh" reason="deprecated"/> <!-- Koibal ⇒ Khakas -->
Good of it to tell you exactly which lines to add!
On Tue, Mar 19, 2024 at 10:31 AM Tom Bishop @.***> wrote:
I made another commit. Locally there's a new set of errors, which I'll work on next:
TestLstrConsistency { Error: (TestValidity.java:537) Missing aliases for supplementalMetadata: 10<languageAlias type="ajp" replacement="apc" reason="deprecated"/> <languageAlias type="kgm" replacement="plu" reason="deprecated"/> <languageAlias type="nom" replacement="cbr" reason="deprecated"/> <languageAlias type="pmk" replacement="crr" reason="deprecated"/> <languageAlias type="prp" replacement="gu" reason="deprecated"/> <languageAlias type="szd" replacement="umi" reason="deprecated"/> <languageAlias type="tmk" replacement="tdg" reason="deprecated"/> <languageAlias type="tpw" replacement="tpn" reason="deprecated"/> <languageAlias type="xss" replacement="zko" reason="deprecated"/> <languageAlias type="zkb" replacement="kjh" reason="deprecated"/>
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#issuecomment-2007758115, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGYK4RWZXWE5GS74FLYZBY55AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXG42TQMJRGU . You are receiving this because you were mentioned.Message ID: @.***>
Good of it to tell you exactly which lines to add!
Add where?
supplementalMetadata.xml?
Tests are passing!