ecma402 icon indicating copy to clipboard operation
ecma402 copied to clipboard

Various Unicode BCP 47 locale identifiers issues

Open anba opened this issue 5 years ago • 11 comments

6.2 Language Tags

  • The link to UTS 35 should use https instead of http.
  • "[...] identifies locales using language tags as defined by IETF BCP 47 (RFCs 5646 and 4647 or their successors)" should be changed to refer to Unicode BCP 47 locale identifiers exclusively.
  • "Unicode BCP 47 Locale Identifiers that meet those validity criteria of Unicode Technical Standard 35, section 3.2 [....]" needs to be reworded, because "validity" can now be misunderstood to mean "validity" as specified in UTS 35 (cf. the "Validity / Comments" column in UTS 35).
    • IIUC "structurally valid" in ECMA-402 maps to "syntactically well-formed" in UTS 35.
  • "[...] without reference to the IANA Language Subtag Registry" may no longer be needed resp. it should be said, that ECMA-402 considers those languages tags as valid which match the syntax of Unicode BCP 47 locale identifiers, but that it is not required to validate them according to the Unicode validation data. (For example "aaj" is a valid language tag in ECMA-402 even though "aaj" is not included in https://unicode.org/repos/cldr/tags/latest/common/validity/language.xml.)

6.2.1 Unicode Locale Extension Sequences


6.2.2 IsStructurallyValidLanguageTag

  • The next revision of UTS 35 will remove the ABNF grammar, so IsStructurallyValidLanguageTag will need to refer to the EBNF grammar. Ref: http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#Unicode_language_identifier
  • It should be clarified whether or not "does not include duplicate variant subtags" also applies to variant subtags in tlang. For example is "en-t-en-emodeng-emodeng" valid or not?

6.2.3 CanonicalizeLanguageTag

  • The next revision of UTS 35 adds definitions for "canonical syntax" and "canonical form" of Unicode locale identifiers. Does it make sense to switch to using them? (Note: rev34 also has "canonical form", which is only a subset of "canonical syntax" from rev35.)
    • "BCP 47 Language Tag to Unicode BCP 47 Locale Identifier" in rev53 and rev54 does not replace variant subtags, which were replaced before the switch to Unicode BCP 47 locale ids. For example IETF BCP 47 language tags canonicalises "hy-arevmda" to "hyw", whereas "BCP 47 Language Tag to Unicode BCP 47 Locale Identifier" doesn't touch any variant subtags. Canonical Unicode locale identifiers in rev35 will support this canonicalisation. (But "ja-Latn-hepburn-heploc" is still not canonicalised to "ja-Latn-alalc97", instead "ja-Latn-hepburn-alalc97" is used. Not sure if this a bug or unsupported canonicalisation mode in CLDR?)
    • "canonical syntax" reorders variant subtags in alphabetical order, which is not allowed per RFC 5646. For example "sl-rozaj-biske" is reordered to "sl-biske-rozaj" in UTS 35, but this actually invalidates the language tag per IANA, because the required prefix for "biske" is "sl-rozaj".
    • Unfortunately the "canonical form" in UTS 35 rev54 also adds many more canonicalisation requirements and I'm not sure these make sense for ECMA-402 (at least for the moment).
  • Do we require to normalise the case in the tlang extension? For example should "en-t-en-us" be case regularised to "en-t-en-US"?

anba avatar Mar 13 '19 16:03 anba

@FrankYFTang

sffc avatar Mar 14 '19 00:03 sffc

@anba As in a big picture, all the issues you mentioned seems reasonable to me. I suggest you create a PR based on what you stated above and we can review the wording of the changes together.

FrankYFTang avatar Mar 14 '19 01:03 FrankYFTang

  • [ ] It should be clarified whether or not "does not include duplicate variant subtags" also applies to variant subtags in tlang. For example is "en-t-en-emodeng-emodeng" valid or not?
  • [ ] The next revision of UTS 35 adds definitions for "canonical syntax" and "canonical form" of Unicode locale identifiers. Does it make sense to switch to using them?
  • [ ] Do we require to normalise the case in the tlang extension? For example should "en-t-en-us" be case regularised to "en-t-en-US"?

While I can provide changes for most of other parts, these are questions we should evolve within a discussion next Thursday. I don't have any immediate answer for these, at least.

I'll have a PR with the other parts as I already did with the https parts (see #331)

leobalter avatar Mar 14 '19 20:03 leobalter

cc @FrankYFTang @zbraniecki to follow up and reference the public spec once it's published

leobalter avatar Mar 21 '19 16:03 leobalter

@FrankYFTang Have you followed up with Mark about this in the CLDR spec?

sffc avatar Apr 29 '19 17:04 sffc

The duplicated-variants restriction and whether it ought apply to tlang is rearing its head in SpiderMonkey patchwork and reviewing at this point. Allowing tlang to contain duplicate variants, while unicode_language_id cannot contain them, is forcing the addition of a enum class DuplicateVariants { Allow, Reject } to our canonicalize-language-id operation, with corresponding complexity to only reject duplicates when DuplicateVariants::Reject is passed. This seems undesirable.

Either duplicates should be allowed in both productions (but canonicalization should remove all but one of each duplicate variant), or they should be allowed in neither. I don't remember why IsStructurallyValidLanguageTag includes a no-duplicate-variants restriction. Revision history on Github doesn't reveal a rationale for the choice.

If the reason for the restriction is sensible and good, I think we ought apply it everywhere. But if it is questionable in any way, being slightly more liberal about allowing harmlessly-duplicate variants (but removing the duplication during canonicalizing) seems like the right approach.

jswalden avatar Feb 07 '20 19:02 jswalden

The duplicate variant restriction may come from BCP 47, §2.2.5, item 5:

The same variant subtag MUST NOT be used more than once within a language tag.

  • For example, the tag "de-DE-1901-1901" is not valid.

anba avatar Feb 07 '20 23:02 anba

Hmm, okay. That seems pretty clear and direct about invalidity. I can't think of a serious case for not applying that to tlang as well -- anything that actually wanted to interpret transform extensions, well, it's going to have to apply that restriction internally, right?

jswalden avatar Feb 08 '20 00:02 jswalden

BCP 47, § 2.2.9 is probably a better reference point, because it also contains the other restrictions present in IsStructurallyValidLanguageTag.

6.2.2 IsStructurallyValidLanguageTag:

The IsStructurallyValidLanguageTag abstract operation verifies that the locale argument (which must be a String value)

  • represents a well-formed Unicode BCP 47 Locale Identifier" as specified in Unicode Technical Standard 35 section 3.2, or successor,
  • does not include duplicate variant subtags, and
  • does not include duplicate singleton subtags.

and BCP 47, § 2.2.9:

A tag is considered "valid" if it satisfies these conditions:

  • The tag is well-formed.
  • Either the tag is in the list of grandfathered tags or all of its primary language, extended language, script, region, and variant subtags appear in the IANA Language Subtag Registry as of the particular registry date.
  • There are no duplicate variant subtags.
  • There are no duplicate singleton (extension) subtags.

(The second bullet point isn't present in ECMA-402, because it'd require shipping an up-to-date language tag registry.)

anba avatar Feb 10 '20 10:02 anba

We discussed this today and concluded figuring out the duplicate-variant concern does not have to be immediately resolved, and if an ECMA-402 published edition ends up lagging the "living standard" spec, that's okay.

I'll look into creating a PR to additionally forbid duplicate variants in tlang.

jswalden avatar Feb 27 '20 20:02 jswalden

@ben-allen to evaluate which, if any, of the items in the OP still need to be addressed.

sffc avatar Sep 18 '23 23:09 sffc

All of the above appear to be resolved by the following commits:

commit 1e5df59e7b6ee6fe549dec2429dcb71e19b0e368
Author: Leo Balter <[email protected]>
Date:   Thu Mar 14 16:06:55 2019 -0400

    Normative: Apply recommended updates for BCP 47 Locale Identifiers

    Ref #330

and

commit 378ba6f03aa36e2d4fa70c8e087bdb99e6ed1b20
Author: Jeff Walden <[email protected]>
Date:   Wed Feb 17 17:06:22 2021 -0800

    Do not allow duplicate variants within the tlang component of a transformed content extension. (#429)

I believe this one should be closed.

ben-allen avatar May 02 '24 13:05 ben-allen

Closed because all but one bullet point has been addressed in PRs from 2019 and 2021. The remaining bullet point, on sl-rozaj-biske being reordered to sl-biske-rozaj against RFC 5646 rules, was resolved by the removal of RFC 5646 from the normative references. See:

commit 90bd833eda51047ce9b40c73ee753a2a1a08f971 (HEAD)
Author: André Bargull <[email protected]>
Date:   Mon Mar 16 02:28:15 2020 -0700

    Editorial: Replace more BCP 47 language tag with Unicode BCP 47 locale identifier

    Also remove the reference to BCP 47 RFCs in the normative references section.

ben-allen avatar May 09 '24 21:05 ben-allen