ecma402
ecma402 copied to clipboard
Various Unicode BCP 47 locale identifiers issues
- The link to UTS 35 should use https instead of http.
- "[...] identifies locales using language tags as defined by IETF BCP 47 (RFCs 5646 and 4647 or their successors)" should be changed to refer to Unicode BCP 47 locale identifiers exclusively.
- "Unicode BCP 47 Locale Identifiers that meet those validity criteria of Unicode Technical Standard 35, section 3.2 [....]" needs to be reworded, because "validity" can now be misunderstood to mean "validity" as specified in UTS 35 (cf. the "Validity / Comments" column in UTS 35).
- IIUC "structurally valid" in ECMA-402 maps to "syntactically well-formed" in UTS 35.
- "[...] without reference to the IANA Language Subtag Registry" may no longer be needed resp. it should be said, that ECMA-402 considers those languages tags as valid which match the syntax of Unicode BCP 47 locale identifiers, but that it is not required to validate them according to the Unicode validation data. (For example "aaj" is a valid language tag in ECMA-402 even though "aaj" is not included in https://unicode.org/repos/cldr/tags/latest/common/validity/language.xml.)
6.2.1 Unicode Locale Extension Sequences
- The definition should be changed to refer to
unicode_locale_extensions
from https://unicode.org/reports/tr35/#Unicode_locale_identifier.
6.2.2 IsStructurallyValidLanguageTag
- The next revision of UTS 35 will remove the ABNF grammar, so
IsStructurallyValidLanguageTag
will need to refer to the EBNF grammar. Ref: http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#Unicode_language_identifier - It should be clarified whether or not "does not include duplicate variant subtags" also applies to variant subtags in
tlang
. For example is "en-t-en-emodeng-emodeng" valid or not?
- The next revision of UTS 35 adds definitions for "canonical syntax" and "canonical form" of Unicode locale identifiers. Does it make sense to switch to using them? (Note: rev34 also has "canonical form", which is only a subset of "canonical syntax" from rev35.)
- "BCP 47 Language Tag to Unicode BCP 47 Locale Identifier" in rev53 and rev54 does not replace variant subtags, which were replaced before the switch to Unicode BCP 47 locale ids. For example IETF BCP 47 language tags canonicalises "hy-arevmda" to "hyw", whereas "BCP 47 Language Tag to Unicode BCP 47 Locale Identifier" doesn't touch any variant subtags. Canonical Unicode locale identifiers in rev35 will support this canonicalisation. (But "ja-Latn-hepburn-heploc" is still not canonicalised to "ja-Latn-alalc97", instead "ja-Latn-hepburn-alalc97" is used. Not sure if this a bug or unsupported canonicalisation mode in CLDR?)
- "canonical syntax" reorders variant subtags in alphabetical order, which is not allowed per RFC 5646. For example "sl-rozaj-biske" is reordered to "sl-biske-rozaj" in UTS 35, but this actually invalidates the language tag per IANA, because the required prefix for "biske" is "sl-rozaj".
- Unfortunately the "canonical form" in UTS 35 rev54 also adds many more canonicalisation requirements and I'm not sure these make sense for ECMA-402 (at least for the moment).
- Do we require to normalise the case in the
tlang
extension? For example should "en-t-en-us" be case regularised to "en-t-en-US"?
@FrankYFTang
@anba As in a big picture, all the issues you mentioned seems reasonable to me. I suggest you create a PR based on what you stated above and we can review the wording of the changes together.
- [ ] It should be clarified whether or not "does not include duplicate variant subtags" also applies to variant subtags in tlang. For example is "en-t-en-emodeng-emodeng" valid or not?
- [ ] The next revision of UTS 35 adds definitions for "canonical syntax" and "canonical form" of Unicode locale identifiers. Does it make sense to switch to using them?
- [ ] Do we require to normalise the case in the tlang extension? For example should "en-t-en-us" be case regularised to "en-t-en-US"?
While I can provide changes for most of other parts, these are questions we should evolve within a discussion next Thursday. I don't have any immediate answer for these, at least.
I'll have a PR with the other parts as I already did with the https parts (see #331)
cc @FrankYFTang @zbraniecki to follow up and reference the public spec once it's published
@FrankYFTang Have you followed up with Mark about this in the CLDR spec?
The duplicated-variants restriction and whether it ought apply to tlang
is rearing its head in SpiderMonkey patchwork and reviewing at this point. Allowing tlang
to contain duplicate variants, while unicode_language_id
cannot contain them, is forcing the addition of a enum class DuplicateVariants { Allow, Reject }
to our canonicalize-language-id operation, with corresponding complexity to only reject duplicates when DuplicateVariants::Reject
is passed. This seems undesirable.
Either duplicates should be allowed in both productions (but canonicalization should remove all but one of each duplicate variant), or they should be allowed in neither. I don't remember why IsStructurallyValidLanguageTag
includes a no-duplicate-variants restriction. Revision history on Github doesn't reveal a rationale for the choice.
If the reason for the restriction is sensible and good, I think we ought apply it everywhere. But if it is questionable in any way, being slightly more liberal about allowing harmlessly-duplicate variants (but removing the duplication during canonicalizing) seems like the right approach.
The duplicate variant restriction may come from BCP 47, §2.2.5, item 5:
The same variant subtag MUST NOT be used more than once within a language tag.
- For example, the tag "de-DE-1901-1901" is not valid.
Hmm, okay. That seems pretty clear and direct about invalidity. I can't think of a serious case for not applying that to tlang
as well -- anything that actually wanted to interpret transform extensions, well, it's going to have to apply that restriction internally, right?
BCP 47, § 2.2.9 is probably a better reference point, because it also contains the other restrictions present in IsStructurallyValidLanguageTag
.
6.2.2 IsStructurallyValidLanguageTag:
The IsStructurallyValidLanguageTag abstract operation verifies that the locale argument (which must be a String value)
- represents a well-formed Unicode BCP 47 Locale Identifier" as specified in Unicode Technical Standard 35 section 3.2, or successor,
- does not include duplicate variant subtags, and
- does not include duplicate singleton subtags.
and BCP 47, § 2.2.9:
A tag is considered "valid" if it satisfies these conditions:
- The tag is well-formed.
- Either the tag is in the list of grandfathered tags or all of its primary language, extended language, script, region, and variant subtags appear in the IANA Language Subtag Registry as of the particular registry date.
- There are no duplicate variant subtags.
- There are no duplicate singleton (extension) subtags.
(The second bullet point isn't present in ECMA-402, because it'd require shipping an up-to-date language tag registry.)
We discussed this today and concluded figuring out the duplicate-variant concern does not have to be immediately resolved, and if an ECMA-402 published edition ends up lagging the "living standard" spec, that's okay.
I'll look into creating a PR to additionally forbid duplicate variants in tlang
.
@ben-allen to evaluate which, if any, of the items in the OP still need to be addressed.
All of the above appear to be resolved by the following commits:
commit 1e5df59e7b6ee6fe549dec2429dcb71e19b0e368
Author: Leo Balter <[email protected]>
Date: Thu Mar 14 16:06:55 2019 -0400
Normative: Apply recommended updates for BCP 47 Locale Identifiers
Ref #330
and
commit 378ba6f03aa36e2d4fa70c8e087bdb99e6ed1b20
Author: Jeff Walden <[email protected]>
Date: Wed Feb 17 17:06:22 2021 -0800
Do not allow duplicate variants within the tlang component of a transformed content extension. (#429)
I believe this one should be closed.
Closed because all but one bullet point has been addressed in PRs from 2019 and 2021. The remaining bullet point, on sl-rozaj-biske
being reordered to sl-biske-rozaj
against RFC 5646 rules, was resolved by the removal of RFC 5646 from the normative references. See:
commit 90bd833eda51047ce9b40c73ee753a2a1a08f971 (HEAD)
Author: André Bargull <[email protected]>
Date: Mon Mar 16 02:28:15 2020 -0700
Editorial: Replace more BCP 47 language tag with Unicode BCP 47 locale identifier
Also remove the reference to BCP 47 RFCs in the normative references section.