ecma402
ecma402 copied to clipboard
Why is there no `Intl.Locale.prototype.variants`?
Why is there no Intl.Locale.prototype.variants? There are getters for language, region and script but I saw no information about the reason variants is missing or shouldn't be there as well.
@zbraniecki Thoughts on this?
TG2 discussion: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-11-25.md#why-is-there-no-intllocaleprototypevariants-900
There were questions about motivation (most use cases for variants are better served by a corresponding Unicode extension keyword), as well as the shape of this getter (does it return a list? is the list sorted? or does it return a string with multiple subtags?)
I see, thanks for having the discussion!
Since when is the variants part in CLDR "deprecated" though?
I sadly can't remember my exact use-case and should have included it in my original post, but I think it was about two things:
- Identifying different german othographies (variants
1901,1996) - Identifying Pe̍h-ōe-jī orthography/romanization (variant
pehoeji)
The latter got added to the IANA language subtag registry in March this year. I know that isn't CLDR, but I was under the impression that this file is the "source of truth" for registered language subtags used in CLDR and everything else.
I also don't see any kind of "deprecation" of variants here: https://www.unicode.org/reports/tr35
Regarding the type of an eventual variants, I don't see any issue with using an array or even a set here.
I don't know what
The Japanese one from one to two, it’s complicated
is referencing and I don't see the difficulty of parsing variants. They can only ever be 5-8 long alphanumeric strings and they can only be followed by extensions and private use tags, so what's wrong with .split("-")? 😆
While we're at it, I don't see a reason why extensions and private-uses also aren't getters, but I guess that's a different story.
A little more context: I think people on the call were referring to variants as "legacy" or "deprecated" because of the following issues:
- Some variants are better represented as extension keywords.
- Example:
valenciais better as-u-sd-esvc(see https://www.unicode.org/cldr/charts/44/supplemental/territory_subdivisions.html#esvc) - Example:
pinyinis better as-t-i0-pinyin(see https://github.com/unicode-org/cldr/blob/33a95a266905f494cc7a912749024f2dbb989de8/common/bcp47/transform_ime.xml#L16C16-L16C22)
- Example:
- LDML says that variants are supposed to be in alphabetical order, which doesn't make sense with certain IANA subtags
- Example:
"sl-IT-rozaj-biske-1994"would be canonicalized to something like"sl-IT-1994-biske-rozaj"even though the IANA subtag registry says it should be"rozaj-biske"
- Example:
In other words, the comments from the discussion were based on the point of view that variants are basically a grab bag of things that would be better expressed as more specific locale extensions.
Personally, I still think variants are motivated because they remain the standard way of expressing orthographies. Something like el-polyton is a good, modern language identifier that I don't believe has another representation with extension keywords.
Regarding the type of an eventual
variants, I don't see any issue with using an array or even a set here.
Returning an ECMAScript Set is an interesting proposition since it avoids the point of contention on whose ordering to use (IANA's or Unicode's).
Returning an ECMAScript
Setis an interesting proposition since it avoids the point of contention on whose ordering to use (IANA's or Unicode's).
I don't think it would, because ECMAScript Set instances are deterministically ordered.
Yeah, I mainly suggested the Set because it follows the other rule of variants (or any "multi-tag"): uniqueness. But honestly, for most users it would probably be unexpected to get a Set here in comparison to the rest of JS.
Regarding the ordering of variants: I don't really think the array or set needs to be ordered in any specific way other than "the same as supplied". Both CLDR and IANA, if I understand correctly, just define a recommended or canonical way to order them in the context of language subtags, not in the context of JavaScript arrays. And AFAIK implementers need to be able to understand any ordering.
One could even argue that it's more expected if the ordering is the same as the user specified it, even if it's "wrong".
So in general, the ordering, of all things, shouldn't be the blocker here.
BCP47 suggests that the order in the original language tag carries meaning, with earlier subtags subordinating later ones. Specifically, item 6 under section 4.1 (Choice of Language Tag) says:
Use variant subtags sparingly and in the correct order. Most
variant subtags have one or more 'Prefix' fields in the registry
that express the list of subtags with which they are appropriate.
Variants SHOULD only be used with subtags that appear in one of
these 'Prefix' fields. If a variant lists a second variant in
one of its 'Prefix' fields, the first variant SHOULD appear
directly after the second variant in any language tag where both
occur. General purpose variants (those with no 'Prefix' fields
at all) SHOULD appear after any other variant subtags. Order any
remaining variants by placing the most significant subtag first.
If none of the subtags is more significant or no relationship can
be determined, alphabetize the subtags. Because variants are
very specialized, using many of them together generally makes the
tag so narrow as to override the additional precision gained.
Putting the subtags into another order interferes with
interoperability, as well as the overall interpretation of the
tag.
This means that the order should be preserved when there are two or more (and presuming, for the moment, that the tag's author has paid attention to the details in the registry as well as the text just above).
Unicode/CLDR says some different things about the ordering. In practice, the variants are only useful in very specific applications, most of which have nothing to do with locales.
In either case, the original order affects tag matching using one of the fallback schemes, and so should probably be preserved by Intl.Locale against possible future need (a canonicalization operation can be applied separately)
Unfortunately I believe the ordering is one of the main issues that needs to be resolved. We have two specs, IETF and UTS 35, which disagree on the ordering (preserved or alphabetical). ECMA-402 mostly follows Unicode's reckoning of locale identifiers, so it would follow that variants should be alphabetical. However, variants are most useful in IETF's reckoning, where order is preserved.
What currently happens with variant ordering in Intl.Locale.prototype.toString? Can we follow that?
What currently happens with variant ordering in Intl.Locale.prototype.toString? Can we follow that?
The current reality in Chrome at least is this:
(new Intl.Locale("de-bcdefg-abcdefg-12345-1000000-1996")).toString() === "de-1000000-12345-1996-abcdefg-bcdefg"
(new Intl.Locale("sl-IT-rozaj-biske-1994")).toString() === "sl-IT-1994-biske-rozaj"
So it's basically just alphabetic with no special numeric handling ("1000000" ≺ "12345" ∧ "12345" ≺ "abcdefg" ∧ "abcdefg" ≺ "bcdefg").
I couldn't find anything about ordering here or anywhere else in ECMA402, so I guess Intl.Locale.prototype.toString does not (yet) define any ordering? Sorry if I have overlooked something.
Another resource that says basically the same as that BCP47 section and what @sffc has said: https://www.w3.org/International/questions/qa-choosing-language-tags#variants
Both, that BCP47 section and that W3 link, claim that the ordering of variants helps with interoperability but don't get more specific, so I'm really unsure if this is actually the case. Like, is there any application out there that would completely break down if I give it a sl-IT-1994-biske-rozaj instead of a sl-IT-rozaj-biske-1994?
Either way, I get that the ordering is important, within language tag strings. But this would be addressed by fixing Intl.Locale.prototype.toString, no? I personally still don't see how this is related to what Intl.Locale.prototype.variants in JavaScript should look like, since that would be ideally an array or set or whatever that one can then use to loop over or check for specific variants. The hierarchical nature of some variants doesn't and shouldn't, in my opinion, mean that any specific ordering is expected by the user in a JavaScript context.
const describeBookLanguage = (bookName, locale) => {
const prefix = `${bookName} is written in`;
let languageLabel = "Sanskrit";
if (locale.language === "sa" && locale.variants.length !== 0) {
// don't care about the order here
if (locale.variants.includes("itihasa") {
languageLabel = `Epic ${languageLabel}`;
}
if (locale.variants.includes("bauddha") {
languageLabel = `Bhuddist Hybrid ${languageLabel}`;
// "Bhuddist Hybrid Epic Sanskrit" is technically possible here but probably not real
}
}
else if (locale.language === "cls") {
languageLabel = `Classical ${languageLabel}`;
}
else if (locale.language === "vsn") {
languageLabel = `Vedic ${languageLabel}`;
}
return `${prefix} ${languageLabel}.`;
}
Or is it expected that something like the below should also work?
const firstPart = "sl-IT";
const secondPart = "rozaj-biske-1994";
const localeString1 = `${firstPart}-${secondPart}`; // "sl-IT-rozaj-biske-1994";
const locale = new Intl.Locale(localeString1);
const localeString2 = locale.toString(); // "sl-IT-1994-biske-rozaj";
const variants = locale.variants; // ["1994", "biske", "rozaj"]
const localeString3 = `${firstPart}-${variants.join("-")}`; // "sl-IT-1994-biske-rozaj"
const allSame = localeString1 === localeString2 && localeString2 === localeString3; // false, but should this be true?
The below would also be an issue if one expects a specific ordering of variants, but again, I don't think that expectation exists.
const slovenianVariantDescriptionParts = new Map([
["rozaj", "Resian"],
["biske", ", San Giorgio dialect"],
["lipaw", ", Lipovaz dialect"],
["njiva", ", Gniva dialect"],
["osojs", ", Oseacco dialect"],
["solba", ", Stolvizza dialect"],
["1994", ", in standardized 1994 orthography"]
]);
const describeSlovenianLanguageUsed = (locale) => {
if (locale.language !== "sl") {
throw new Error("Not Slovenian");
}
if (locale.variants.length === 0 || !locale.variants.includes("rozaj")) {
return "Slovenian";
}
return locale.variants
.map(variant => slovenianVariantDescriptionParts.get(variant))
.join("");
// depending on the order of variants, this could result in:
// - "Resian, San Giorgio dialect, in standardized 1994 orthography" ✅
// - ", Gniva dialect, in standardized 1994 orthographyResian" ❌
// - ", in standardized 1994 orthographyResian" ❌
// - ", Stolvizza dialectResian" ❌
};
I also still don't agree with this sentiment:
In practice, the variants are only useful in very specific applications, most of which have nothing to do with locales.
Again, maybe I'm misunderstanding something, but en-basiceng, de-1901, sgn-ase-blasl, sa-itihasa and el-polyton all seem like valid and not too niche uses of variants. And even if these are considered niche or "specific", I don't think the goal of i18n/l10n is to only consider commonly used and general cases. 😜
I'm not saying this is the most important thing ever, but I also wouldn't disregard variants as something "deprecated" or "only useful in very specific applications, most of which have nothing to do with locales".
So all in all, I think the ordering of Intl.Locale.prototype.variants or rather Intl.Locale.prototype.getVariants simply shouldn't be defined as long as the ordering of variants in Intl.Locale.prototype.toString also isn't defined. In case it either is defined and I just overlooked it, or it absolutely needs to be specified, I think following suit wtih getTimeZones and getCollections would be fine, which would simply be alphabetic order (or "lexicographic code unit order" to be precise).
For reference, how the ordering issue has been "solved" in a past issue: https://github.com/tc39/ecma402/issues/330#issuecomment-2103421993
Sorry for the triple comment but here a quote from UTS 35 which ECMA402 follows, as far as I understand now:
NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering in Section 4.1 of BCP 47. Here are the considerations that lead to that decision:
- The ordering in is recommended, but not required for conformance. In particular, use of and ordering by Prefix is recommended but not required.
- Moreover, Section 4.5 states that “If more than one variant appears within a tag, processors MAY reorder the variants to obtain better matching behavior or more consistent presentation.”
- The best practices for internationalization have moved well beyond some of the guidelines and recommendations in BCP 47, especially for language matching and language fallback.
- Robust implementations will accept the variants in any order, just as they accept extensions in any order.
- A canonical order allows for determination of identity of identifiers via string comparison.
- The ordering in does not result in a determinant order for canonicalization, since the mechanism for determining “importance” is not specified: ca-valencia-fonipa and ca-fonipa-valencia could both be ‘canonical’ variants of one another.
- Pure alphabetical order is determinant and simple to implement while the ordering in is indeterminant, more complex, and provides no significant benefit in modern applications.
So all in all, I think the ordering of
Intl.Locale.prototype.variantsor ratherIntl.Locale.prototype.getVariantssimply shouldn't be defined as long as the ordering of variants inIntl.Locale.prototype.toStringalso isn't defined. In case it either is defined and I just overlooked it
For the record, it is defined:
- The Intl.Locale constructor returns a locale object whose [[Locale]] slot is set to the [[locale]] field of the Record returned from MakeLocaleRecord(tag, opt, localeExtensionKeys).
- MakeLocaleRecord returns a result Record whose [[locale]] field is the value returned from either CanonicalizeUnicodeLocaleId(locale) or InsertUnicodeExtensionAndCanonicalize(locale, attributes, keywords) (the latter of which always returns the result of an internal use of the former).
- CanonicalizeUnicodeLocaleId returns a String that starts with "the String value resulting from performing the algorithm to transform locale to canonical form per Unicode Technical Standard #35 Part 1 Core, Annex C LocaleId Canonicalization".
- UTS 35, as noted above, starts with a Canonicalizing Syntax step that includes "Put any variants into alphabetical order (eg, en-fonipa-scouse, not en-scouse-fonipa)".
- Intl.Locale.prototype.toString just returns the [canonicalized] contents of its receiver's [[Locale]] slot.
Ok, so .variants returning an array with the variants in UTS 35 order would be most consistent with the rest of the spec, and I think functions such as describeSlovenianLanguageUsed could reconstruct what they need from the variants, even if they are in UTS 35 order. Is that accurate?
@sffc That seems logical to me with the only addition that it might needs to be .getVariants instead.
Discussion today identified another design question: representation as a primitive dash-separated string with a simple variants getter (presumably to be split for inspection), or as a fresh array of strings from getVariants() (presumably to be joined for comparison).
There was also a request for motivation, and the most widely-used example I can think of is the Wikipedia IPA template, which adds markup indicating the "fonipa" variant and can be used for styling/hooking advanced functionality/etc.
Discussion today identified another design question: representation as a primitive dash-separated string with a simple
variantsgetter (presumably to besplitfor inspection), or as a fresh array of strings fromgetVariants()(presumably to bejoined for comparison).There was also a request for motivation, and the most widely-used example I can think of is the Wikipedia IPA template, which adds markup indicating the "fonipa" variant and can be used for styling/hooking advanced functionality/etc.
Can you link to the discussion? "Deprecating" variants would tend to seriously marginalize language communities.
Notes will be added to meetings soon, but no one is talking about deprecating variants—just about whether or not to expose them via a dedicated interface of Intl.Locale instances, the way that language/script/region and baseName (the full unicode_language_id including them) already are.
Notes will be added to meetings soon, but no one is talking about deprecating variants—just about whether or not to expose them via a dedicated interface of Intl.Locale instances, the way that
language/script/regionandbaseName(the fullunicode_language_idincluding them) already are.
Thanks. That would still be problematic and incomplete
Some variants are better represented as extension keywords. Example: valencia is better as -u-sd-esvc
Some, but not all. A variant may not fit into an exact geopolitical boundary. Also while variant implementation is extant but problematic, such extensions have fewer, if any, implementations.
no one is talking about deprecating variants—just about whether or not to expose them via a dedicated interface of Intl.Locale instances, the way that
language/script/regionandbaseName(the fullunicode_language_idincluding them) already are.Thanks. That would still be problematic and incomplete
It is the current state of affairs; cf. MDN documentation and the spec text. The best way to get variants from a locale identifier right now is manual parsing, e.g.
const variantsFromLocaleId = localeId =>
(
new Intl.Locale(localeId).baseName.replace(
/^[a-z]+(-([a-z]{4}|[a-z]{2}|[0-9]{3})\b)*-?/i,
'',
) || undefined
)?.split('-');