cldr CLDR-17115 Update languages/codes

-Numerous changes based on following instructions in Update Language/Script/Region Subtags

-Error summary (running locally):

CLDR/TestLocale/testLanguageTagParserIsValid
CLDR/TestSupplementalInfo/TestMacrolanguages
CLDR/TestValidity/TestCompatibility
CLDR/TestValidity/TestLstrConsistency

CLDR-17115

[ ] This PR completes the ticket.

ALLOW_MANY_COMMITS=true

Feb 28 '24 15:02 btangmu

@macchiati as you requested, I reverted scripts.xml in the last commit

Feb 28 '24 18:02 btangmu

Per discussion, I've reverted iso_3166_status.txt to main

Mar 13 '24 17:03 btangmu

It looks like there are a surprising number of errors. I think it is best for me to walk you through this, and you can capture these notes in the instructions.

It appears that ISO had an unexpected number of deprecations, so you're seeing more issues that we normally see.

For lines like the following: Error: (TestLocale.java:921) Error: : ajp: expected "", got "Disallowed language=ajp, status=deprecated" Error: (TestLocale.java:927) Error: : ajp_Arab_JO: expected "", got "Disallowed language=ajp, status=deprecated

What is happening is that likelySubtags.xml is handling languages that are now deprecated. That is to be expected, because ISO does that occasionally, but because we added a lot of SIL language data, the number may be larger each year. To fix that, go to the file an delete the line where it is handled, and delete that line, in this case:

TestMacrolanguages Error: (TestSupplementalInfo.java:1328) Error: Macrolanguage sa Sanskrit Historical

It looks like the classification changed in ISO. We still use 'sa', because the India government disagrees that it is only historical!Add to if (language.equals("no") || language.equals("sh")) continue; // special cases

TestCompatibility Error: (TestValidity.java:284) Error: language:dzd:deprecated => regular // add to exception list (ALLOWED_UNDELETIONS) if really un-deprecated

Check the diff in the iso-639 files to verify that dzd is really de-deprecated. Then add dzd to ALLOWED_UNDELETIONS

The "ERROR:" values below in the listing all look like keyboard stuff; I don't think those are counted. I'll file a ticket for Steven to clean those up.

Mar 13 '24 18:03 macchiati

Good work. That verifies that it is indeed an intentional change.

On Thu, Mar 14, 2024 at 8:26 AM Tom Bishop @.***> wrote:

@.**** commented on this pull request.

In common/validity/language.xml https://github.com/unicode-org/cldr/pull/3538#discussion_r1525078910:
		baz bbz bcc bcl bgm bh bhk bic bij bjd bjq bkb blg bmy bpb btb btl bxk bxr bxx byy
  	cbe cbh cca ccq cdg cjr cka cld cmk cmn cnr coy cqu cug cum cwd
	daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl dzd
	daf dap dgo dgu dha dhd dik diq dit djl dkl drh drr drw dud duj dwl
"dzd" was removed here

Web search for "dzd deprecated" turns up this file:

https://www.iana.org/assignments/lang-subtags-templates/dzd-2023-03-17.txt

which reads as follows:

FOR ARCHIVING: Registration form for 'dzd'

LANGUAGE SUBTAG REGISTRATION FORM

Name of requester: Doug Ewell 2.

E-mail address of requester: doug at ewellic.org 3.

Record Requested:

Type: language Subtag: dzd Description: Daza

Intended meaning of the subtag: 2.

Reference to published description of the language (book or article): 3.

Any other relevant information:

This registration tracks a change made to ISO 639-3 effective 2023-01-20, adding the code element 'dzd' for Daza, which had been retired in 2015 as non-existent. The net effect of this registration is to remove the Deprecated value from this record.

For more information on the ISO 639-3 change, refer to: https://iso639-3.sil.org/request/2022-027

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#pullrequestreview-1937046932, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMCNSFWGIOHOFUGAZA3YYG6T5AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSMZXGA2DMOJTGI . You are receiving this because you were mentioned.Message ID: @.***>

Mar 14 '24 16:03 macchiati

@macchiati my latest commit fixes the problems with sa (Sanskrit) and dzd (Daza). It does not fix the problems with ajp and others in this output:

    testLanguageTagParserIsValid {
      Error: (TestLocale.java:921) : ajp: expected "", got "Disallowed language=ajp, status=deprecated"
      Error: (TestLocale.java:927) : ajp_Arab_JO: expected "", got "Disallowed language=ajp, status=deprecated"
      Error: (TestLocale.java:921) : kgm: expected "", got "Disallowed language=kgm, status=deprecated"
      Error: (TestLocale.java:927) : kgm_Latn_BR: expected "", got "Disallowed language=kgm, status=deprecated"
      Error: (TestLocale.java:921) : ksa: expected "", got "Disallowed language=ksa, status=deprecated"
      Error: (TestLocale.java:927) : ksa_Latn_NG: expected "", got "Disallowed language=ksa, status=deprecated"
      Error: (TestLocale.java:921) : nom: expected "", got "Disallowed language=nom, status=deprecated"
      Error: (TestLocale.java:927) : nom_Latn_PE: expected "", got "Disallowed language=nom, status=deprecated"
      Error: (TestLocale.java:921) : plj: expected "", got "Disallowed language=plj, status=deprecated"
      Error: (TestLocale.java:927) : plj_Latn_NG: expected "", got "Disallowed language=plj, status=deprecated"
      Error: (TestLocale.java:921) : prp: expected "", got "Disallowed language=prp, status=deprecated"
      Error: (TestLocale.java:927) : prp_Gujr_IN: expected "", got "Disallowed language=prp, status=deprecated"
      Error: (TestLocale.java:921) : slq: expected "", got "Disallowed language=slq, status=deprecated"
      Error: (TestLocale.java:927) : slq_Arab_IR: expected "", got "Disallowed language=slq, status=deprecated"
      Error: (TestLocale.java:921) : szd: expected "", got "Disallowed language=szd, status=deprecated"
      Error: (TestLocale.java:927) : szd_Latn_MY: expected "", got "Disallowed language=szd, status=deprecated"
      Error: (TestLocale.java:921) : tmk: expected "", got "Disallowed language=tmk, status=deprecated"
      Error: (TestLocale.java:927) : tmk_Deva_NP: expected "", got "Disallowed language=tmk, status=deprecated"
      Error: (TestLocale.java:921) : xss: expected "", got "Disallowed language=xss, status=deprecated"
      Error: (TestLocale.java:927) : xss_Cyrl_RU: expected "", got "Disallowed language=xss, status=deprecated"
      Error: (TestLocale.java:921) : zkb: expected "", got "Disallowed language=zkb, status=deprecated"
      Error: (TestLocale.java:927) : zkb_Cyrl_RU: expected "", got "Disallowed language=zkb, status=deprecated"
      Error: (TestLocale.java:921) : zua: expected "", got "Disallowed language=zua, status=deprecated"
      Error: (TestLocale.java:927) : zua_Latn_NG: expected "", got "Disallowed language=zua, status=deprecated"

You addressed these errors in your last comment, but I still don't understand; they're different from the "sa" error.

"ajp" occurs in languageGroup.xml, languageInfo.xml, and likelySubtags.xml. Should it be deleted from languageInfo.xml, and/or likelySubtags.xml, and then should languageGroup.xml be regenerated?

Mar 15 '24 14:03 btangmu

Here is what to do in more detail.

Case 1, replaced by old:

Take ajp

Look at language-subtag-registry (the diff from the old one)

You see that ajp has 2 items added:

Deprecated: 2023-03-17
Preferred-Value: apc

That means that wherever it occurs, "apc" should be substituted. However, if you look at apc, it is not new. So the actions are to delete it in those files where it occurs. Search the directory supplemental. You find:

languageGroup.xml
93: <languageGroup parent="sem">aao abh acm acq acy aeb aec agj aii ajp akk am …</languageGroup> 

languageInfo.xml
170: <languageMatch desired="ajp" supported="ar" distance="10" oneway="true"/> <!-- South Levantine Arabic --> 

likelySubtags.xml (6 matches)
2,883: <likelySubtag from="ajp" to="ajp_Arab_JO" origin="sil1"/> <!-- South Levantine Arabic ➡︎ South Levantine Arabic (Arabic, Jordan) --> 
4,461: <likelySubtag from="gra" to="gra_Deva_IN" origin="sil1"/> <!-- Rajput Garasia ➡︎ Rajput Garasia (Devanagari, India) --> 
4,462: <likelySubtag from="gra_Gujr" to="gra_Gujr_IN" origin="sil1"/> <!-- Rajput Garasia (Gujarati) ➡︎ Rajput Garasia (Gujarati, India) -->

In languageGroup: If 'apc', didn't exist in that file you would replace it. Since it does, you just delete it (leaving the rest of the line alone).

93: <languageGroup parent="sem">aao abh acm acq acy aeb aec agj aii akk am …</languageGroup>

Same in languageInfo.xml and likelySubtags.xml. 'apc' exists in each, so just delete the lines.

Suppose it were in supplementalData in the territory information (it doesn't so this is just illustration!!)

<territory type="PS" gdp="21220000000" literacyPercent="95.3" population="4818260">	<!--Palestinian Territories-->
  <languagePopulation type="ar" populationPercent="100" officialStatus="official"/>	<!--Arabic-->
  <languagePopulation type="apc" populationPercent="87" references="R1173"/>	<!--Levantine Arabic-->
  <languagePopulation type="ajp" populationPercent="2" references="..."/>	<!--South Levantine Arabic-->

In that case you would combine the two figures to get:

  <languagePopulation type="apc" populationPercent="89" references="R1173"/>	<!--Levantine Arabic-->

Use your judgment: sometimes language counts are doubled for bilingual speakers, so if it adds to a crazy amount, don't add it. (These figures are 'best available', so that's ok.)

Case 2, no preferred

In this case, just drop the lines.

Case 3, split

Subtag: ksa
Description: Shuwa-Zamani
Added: 2009-07-29
Deprecated: 2023-03-17
Comments: see izm, rsw

Look at iso-639-3_Retirements.tab for ksa

You'll see "Split into [rsw] Rishiwa and [izm] Kizamani"

Take the first one, and treat this case like Case 1.

Mar 16 '24 14:03 macchiati

@macchiati I've started to follow your directions for "ajp", ...

likelySubtags.xml says "Likely subtags data is generated programatically from CLDR's language/territory/population data using the GenerateMaximalLocales tool. Under normal circumstances, this file should not be patched by hand, as any changes made in that fashion may be lost."

So I tried to run GenerateMaximalLocales and got "IllegalArgumentException: Don't run this tool until it is fixed":

    public static void main(String[] args) throws IOException {
        if (true) {
            throw new IllegalArgumentException("Don't run this tool until it is fixed");
        }

So I'll try hand-editing likelySubtags.xml anyway...

Mar 19 '24 14:03 btangmu

Right, we disabled the tool for now. It should be easy to regex-search for (ajp|...) to find all the lines, although you want to look at each one rather than automatically deleting.

On Tue, Mar 19, 2024 at 7:54 AM Tom Bishop @.***> wrote:

@macchiati https://github.com/macchiati I've started to follow your directions for "ajp", ...

likelySubtags.xml says "Likely subtags data is generated programatically from CLDR's language/territory/population data using the GenerateMaximalLocales tool. Under normal circumstances, this file should not be patched by hand, as any changes made in that fashion may be lost."

So I tried to run GenerateMaximalLocales and got "IllegalArgumentException: Don't run this tool until it is fixed":
public static void main(String[] args) throws IOException {
    if (true) {
        throw new IllegalArgumentException("Don't run this tool until it is fixed");
    }
So I'll try hand-editing likelySubtags.xml anyway...

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#issuecomment-2007398255, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMHOM4XOIU3I5RLCXVLYZBGTTAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGM4TQMRVGU . You are receiving this because you were mentioned.Message ID: @.***>

Mar 19 '24 15:03 macchiati

@macchiati FYI you wrote that iso-639-3_Retirements.tab says "Split into [rsw] Rishiwa and [izm] Kizamani" but the version I'm seeing (in the branch for this ticket) doesn't say anything like that -- because that file is changed in this PR! So I need to look at the version of that file before this PR. Just something to be aware of when we update the instructions...

Mar 19 '24 15:03 btangmu

Right. What I do is look at the diffs in the PR.

BTW, as you go through this, please jot down in a doc or text file what you are doing, so that we can use that as a basis for updating the instructions.

On Tue, Mar 19, 2024 at 8:22 AM Tom Bishop @.***> wrote:

@macchiati https://github.com/macchiati FYI you wrote that iso-639-3_Retirements.tab says "Split into [rsw] Rishiwa and [izm] Kizamani" but the version I'm seeing (in the branch for this ticket) doesn't say anything like that -- because that file is changed in this PR! So I need to look at the version of that file before this PR. Just something to be aware of when we update the instructions...

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#issuecomment-2007479687, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMDPXCMZ3O6VH5SSIQ3YZBJ4DAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGQ3TSNRYG4 . You are receiving this because you were mentioned.Message ID: @.***>

Mar 19 '24 15:03 macchiati

@macchiati these two files disagree on the replacement for prp (Parsi), whether to change to gu or guj:

language-subtag-registry: Type: language Subtag: prp Description: Parsi Added: 2009-07-29 Deprecated: 2023-03-17 Preferred-Value: gu

iso-639-3_Retirements.tab: prp Parsi M guj 2023-01-20

Since your comments mainly refer to language-subtag-registry I'm guessing "gu", but it's just a wild guess so please confirm or correct!

Actually likelySubtags.xml already has

		<likelySubtag from="gu" to="gu_Gujr_IN"/>
		<!--{ Gujarati; ?; ? } => { Gujarati; Gujarati; India }-->

So I'm just deleting the prp line from that file

Mar 19 '24 16:03 btangmu

I think it should be "gu". "guj" is the ISO 639-3 equivalent of "gu". The ISO 639-1 (two-letter) code is preferred if it exists.

Mar 19 '24 17:03 DavidLRowe

gu is the right choice. (guj is the 3 letter code, but the BCP47 uses 2 letter whenever it exists)

On Tue, Mar 19, 2024 at 9:10 AM Tom Bishop @.***> wrote:

@macchiati https://github.com/macchiati these two files disagree on the replacement for prp (Parsi), whether to change to gu or guj:

language-subtag-registry: Type: language Subtag: prp Description: Parsi Added: 2009-07-29 Deprecated: 2023-03-17 Preferred-Value: gu

iso-639-3_Retirements.tab: prp Parsi M guj 2023-01-20

Since your comments mainly refer to language-subtag-registry I'm guessing "gu", but it's just a wild guess so please confirm or correct!

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#issuecomment-2007591568, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMB54DNLDNEZGRLZIH3YZBPOXAVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGU4TCNJWHA . You are receiving this because you were mentioned.Message ID: @.***>

Mar 19 '24 17:03 macchiati

Another disagreement, for szd -- replace with uki or umi?

iso-639-3_Retirements.tab: szd Seru M uki 2023-01-20

language-subtag-registry: Type: language Subtag: szd Description: Seru Added: 2009-07-29 Deprecated: 2023-03-17 Preferred-Value: umi

likelySubtags.xml has only umi, not uki; I'm just deleting szd from that file

Mar 19 '24 17:03 btangmu

When in doubt, go by the language subtag registry

On Tue, Mar 19, 2024 at 10:16 AM Tom Bishop @.***> wrote:

Another disagreement, for szd -- replace with uki or umi?

iso-639-3_Retirements.tab: szd Seru M uki 2023-01-20

language-subtag-registry: Type: language Subtag: szd Description: Seru Added: 2009-07-29 Deprecated: 2023-03-17 Preferred-Value: umi

likelySubtags.xml has only umi, not uki; I'm just deleting szd from that file

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#issuecomment-2007725129, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMDCTEQ6JATO4AO6BCLYZBXH3AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXG4ZDKMJSHE . You are receiving this because you were mentioned.Message ID: @.***>

Mar 19 '24 17:03 macchiati

I made another commit. Locally there's a new set of errors, which I'll work on next:

    TestLstrConsistency {
      Error: (TestValidity.java:537) Missing aliases for supplementalMetadata: 10
<languageAlias type="ajp" replacement="apc" reason="deprecated"/> <!-- South Levantine Arabic ⇒ Levantine Arabic -->
<languageAlias type="kgm" replacement="plu" reason="deprecated"/> <!-- Karipúna ⇒ Palikúr -->
<languageAlias type="nom" replacement="cbr" reason="deprecated"/> <!-- Nocamán ⇒ Cashibo-Cacataibo -->
<languageAlias type="pmk" replacement="crr" reason="deprecated"/> <!-- Pamlico ⇒ Carolina Algonquian -->
<languageAlias type="prp" replacement="gu" reason="deprecated"/> <!-- Parsi ⇒ Gujarati -->
<languageAlias type="szd" replacement="umi" reason="deprecated"/> <!-- Seru ⇒ Ukit -->
<languageAlias type="tmk" replacement="tdg" reason="deprecated"/> <!-- Northwestern Tamang ⇒ Western Tamang -->
<languageAlias type="tpw" replacement="tpn" reason="deprecated"/> <!-- Tupí ⇒ Tupinambá -->
<languageAlias type="xss" replacement="zko" reason="deprecated"/> <!-- Assan ⇒ Kott -->
<languageAlias type="zkb" replacement="kjh" reason="deprecated"/> <!-- Koibal ⇒ Khakas -->

Mar 19 '24 17:03 btangmu

Good of it to tell you exactly which lines to add!

On Tue, Mar 19, 2024 at 10:31 AM Tom Bishop @.***> wrote:

I made another commit. Locally there's a new set of errors, which I'll work on next:
TestLstrConsistency {
  Error: (TestValidity.java:537) Missing aliases for supplementalMetadata: 10
<languageAlias type="ajp" replacement="apc" reason="deprecated"/> <languageAlias type="kgm" replacement="plu" reason="deprecated"/> <languageAlias type="nom" replacement="cbr" reason="deprecated"/> <languageAlias type="pmk" replacement="crr" reason="deprecated"/> <languageAlias type="prp" replacement="gu" reason="deprecated"/> <languageAlias type="szd" replacement="umi" reason="deprecated"/> <languageAlias type="tmk" replacement="tdg" reason="deprecated"/> <languageAlias type="tpw" replacement="tpn" reason="deprecated"/> <languageAlias type="xss" replacement="zko" reason="deprecated"/> <languageAlias type="zkb" replacement="kjh" reason="deprecated"/>

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/3538#issuecomment-2007758115, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGYK4RWZXWE5GS74FLYZBY55AVCNFSM6AAAAABD6GXADSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXG42TQMJRGU . You are receiving this because you were mentioned.Message ID: @.***>

Mar 19 '24 17:03 macchiati

Good of it to tell you exactly which lines to add!

Add where?

supplementalMetadata.xml?

Mar 19 '24 18:03 btangmu

Tests are passing!

Mar 19 '24 18:03 btangmu