ecma402 icon indicating copy to clipboard operation
ecma402 copied to clipboard

Why is there no `Intl.Locale.prototype.variants`?

Open nnmrts opened this issue 1 year ago • 1 comments

Why is there no Intl.Locale.prototype.variants? There are getters for language, region and script but I saw no information about the reason variants is missing or shouldn't be there as well.

nnmrts avatar Jun 18 '24 18:06 nnmrts

@zbraniecki Thoughts on this?

sffc avatar Jul 18 '24 00:07 sffc

TG2 discussion: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-11-25.md#why-is-there-no-intllocaleprototypevariants-900

There were questions about motivation (most use cases for variants are better served by a corresponding Unicode extension keyword), as well as the shape of this getter (does it return a list? is the list sorted? or does it return a string with multiple subtags?)

sffc avatar Nov 25 '24 18:11 sffc

I see, thanks for having the discussion!

Since when is the variants part in CLDR "deprecated" though?

I sadly can't remember my exact use-case and should have included it in my original post, but I think it was about two things:

  • Identifying different german othographies (variants 1901, 1996)
  • Identifying Pe̍h-ōe-jī orthography/romanization (variant pehoeji)

The latter got added to the IANA language subtag registry in March this year. I know that isn't CLDR, but I was under the impression that this file is the "source of truth" for registered language subtags used in CLDR and everything else.

I also don't see any kind of "deprecation" of variants here: https://www.unicode.org/reports/tr35

Regarding the type of an eventual variants, I don't see any issue with using an array or even a set here.

I don't know what

The Japanese one from one to two, it’s complicated

is referencing and I don't see the difficulty of parsing variants. They can only ever be 5-8 long alphanumeric strings and they can only be followed by extensions and private use tags, so what's wrong with .split("-")? 😆

While we're at it, I don't see a reason why extensions and private-uses also aren't getters, but I guess that's a different story.

nnmrts avatar Nov 26 '24 09:11 nnmrts

A little more context: I think people on the call were referring to variants as "legacy" or "deprecated" because of the following issues:

  1. Some variants are better represented as extension keywords.
    • Example: valencia is better as -u-sd-esvc (see https://www.unicode.org/cldr/charts/44/supplemental/territory_subdivisions.html#esvc)
    • Example: pinyin is better as -t-i0-pinyin (see https://github.com/unicode-org/cldr/blob/33a95a266905f494cc7a912749024f2dbb989de8/common/bcp47/transform_ime.xml#L16C16-L16C22)
  2. LDML says that variants are supposed to be in alphabetical order, which doesn't make sense with certain IANA subtags
    • Example: "sl-IT-rozaj-biske-1994" would be canonicalized to something like "sl-IT-1994-biske-rozaj" even though the IANA subtag registry says it should be "rozaj-biske"

In other words, the comments from the discussion were based on the point of view that variants are basically a grab bag of things that would be better expressed as more specific locale extensions.

Personally, I still think variants are motivated because they remain the standard way of expressing orthographies. Something like el-polyton is a good, modern language identifier that I don't believe has another representation with extension keywords.

Regarding the type of an eventual variants, I don't see any issue with using an array or even a set here.

Returning an ECMAScript Set is an interesting proposition since it avoids the point of contention on whose ordering to use (IANA's or Unicode's).

sffc avatar Nov 28 '24 00:11 sffc

Returning an ECMAScript Set is an interesting proposition since it avoids the point of contention on whose ordering to use (IANA's or Unicode's).

I don't think it would, because ECMAScript Set instances are deterministically ordered.

gibson042 avatar Dec 03 '24 20:12 gibson042

Yeah, I mainly suggested the Set because it follows the other rule of variants (or any "multi-tag"): uniqueness. But honestly, for most users it would probably be unexpected to get a Set here in comparison to the rest of JS.

nnmrts avatar Dec 04 '24 10:12 nnmrts

Regarding the ordering of variants: I don't really think the array or set needs to be ordered in any specific way other than "the same as supplied". Both CLDR and IANA, if I understand correctly, just define a recommended or canonical way to order them in the context of language subtags, not in the context of JavaScript arrays. And AFAIK implementers need to be able to understand any ordering.

One could even argue that it's more expected if the ordering is the same as the user specified it, even if it's "wrong".

So in general, the ordering, of all things, shouldn't be the blocker here.

nnmrts avatar Dec 04 '24 10:12 nnmrts

BCP47 suggests that the order in the original language tag carries meaning, with earlier subtags subordinating later ones. Specifically, item 6 under section 4.1 (Choice of Language Tag) says:

       Use variant subtags sparingly and in the correct order.  Most
       variant subtags have one or more 'Prefix' fields in the registry
       that express the list of subtags with which they are appropriate.
       Variants SHOULD only be used with subtags that appear in one of
       these 'Prefix' fields.  If a variant lists a second variant in
       one of its 'Prefix' fields, the first variant SHOULD appear
       directly after the second variant in any language tag where both
       occur.  General purpose variants (those with no 'Prefix' fields
       at all) SHOULD appear after any other variant subtags.  Order any
       remaining variants by placing the most significant subtag first.
       If none of the subtags is more significant or no relationship can
       be determined, alphabetize the subtags.  Because variants are
       very specialized, using many of them together generally makes the
       tag so narrow as to override the additional precision gained.
       Putting the subtags into another order interferes with
       interoperability, as well as the overall interpretation of the
       tag.

This means that the order should be preserved when there are two or more (and presuming, for the moment, that the tag's author has paid attention to the details in the registry as well as the text just above).

Unicode/CLDR says some different things about the ordering. In practice, the variants are only useful in very specific applications, most of which have nothing to do with locales.

In either case, the original order affects tag matching using one of the fallback schemes, and so should probably be preserved by Intl.Locale against possible future need (a canonicalization operation can be applied separately)

aphillips avatar Dec 04 '24 21:12 aphillips

Unfortunately I believe the ordering is one of the main issues that needs to be resolved. We have two specs, IETF and UTS 35, which disagree on the ordering (preserved or alphabetical). ECMA-402 mostly follows Unicode's reckoning of locale identifiers, so it would follow that variants should be alphabetical. However, variants are most useful in IETF's reckoning, where order is preserved.

What currently happens with variant ordering in Intl.Locale.prototype.toString? Can we follow that?

sffc avatar Dec 04 '24 23:12 sffc

What currently happens with variant ordering in Intl.Locale.prototype.toString? Can we follow that?

The current reality in Chrome at least is this:

(new Intl.Locale("de-bcdefg-abcdefg-12345-1000000-1996")).toString() === "de-1000000-12345-1996-abcdefg-bcdefg"
(new Intl.Locale("sl-IT-rozaj-biske-1994")).toString() === "sl-IT-1994-biske-rozaj"

So it's basically just alphabetic with no special numeric handling ("1000000" ≺ "12345" ∧ "12345" ≺ "abcdefg" ∧ "abcdefg" ≺ "bcdefg").

I couldn't find anything about ordering here or anywhere else in ECMA402, so I guess Intl.Locale.prototype.toString does not (yet) define any ordering? Sorry if I have overlooked something.


Another resource that says basically the same as that BCP47 section and what @sffc has said: https://www.w3.org/International/questions/qa-choosing-language-tags#variants

Both, that BCP47 section and that W3 link, claim that the ordering of variants helps with interoperability but don't get more specific, so I'm really unsure if this is actually the case. Like, is there any application out there that would completely break down if I give it a sl-IT-1994-biske-rozaj instead of a sl-IT-rozaj-biske-1994?

Either way, I get that the ordering is important, within language tag strings. But this would be addressed by fixing Intl.Locale.prototype.toString, no? I personally still don't see how this is related to what Intl.Locale.prototype.variants in JavaScript should look like, since that would be ideally an array or set or whatever that one can then use to loop over or check for specific variants. The hierarchical nature of some variants doesn't and shouldn't, in my opinion, mean that any specific ordering is expected by the user in a JavaScript context.

const describeBookLanguage = (bookName, locale) => {
  const prefix = `${bookName} is written in`;

  let languageLabel = "Sanskrit";

  if (locale.language === "sa" && locale.variants.length !== 0) {

    // don't care about the order here
    if (locale.variants.includes("itihasa") {
      languageLabel = `Epic ${languageLabel}`;
    }
    
    if (locale.variants.includes("bauddha") {
      languageLabel = `Bhuddist Hybrid ${languageLabel}`;

      // "Bhuddist Hybrid Epic Sanskrit" is technically possible here but probably not real
    }
  }
  else if (locale.language === "cls") {
    languageLabel = `Classical ${languageLabel}`;
  }
  else if (locale.language === "vsn") {
    languageLabel = `Vedic ${languageLabel}`;
  }

  return `${prefix} ${languageLabel}.`;
}

Or is it expected that something like the below should also work?

const firstPart = "sl-IT";
const secondPart = "rozaj-biske-1994";

const localeString1 = `${firstPart}-${secondPart}`; // "sl-IT-rozaj-biske-1994";

const locale = new Intl.Locale(localeString1);

const localeString2 = locale.toString(); // "sl-IT-1994-biske-rozaj";

const variants = locale.variants; // ["1994", "biske", "rozaj"]

const localeString3 = `${firstPart}-${variants.join("-")}`; // "sl-IT-1994-biske-rozaj"

const allSame = localeString1 === localeString2 && localeString2 === localeString3; // false, but should this be true?

The below would also be an issue if one expects a specific ordering of variants, but again, I don't think that expectation exists.

const slovenianVariantDescriptionParts = new Map([
  ["rozaj", "Resian"],
  ["biske", ", San Giorgio dialect"],
  ["lipaw", ", Lipovaz dialect"],
  ["njiva", ", Gniva dialect"],
  ["osojs", ", Oseacco dialect"],
  ["solba", ", Stolvizza dialect"],
  ["1994", ", in standardized 1994 orthography"]
]);

const describeSlovenianLanguageUsed = (locale) => {
  if (locale.language !== "sl") {
    throw new Error("Not Slovenian");
  }

  if (locale.variants.length === 0 || !locale.variants.includes("rozaj")) {
    return "Slovenian";
  }

  return locale.variants
    .map(variant => slovenianVariantDescriptionParts.get(variant))
    .join("");

  // depending on the order of variants, this could result in:
  // - "Resian, San Giorgio dialect, in standardized 1994 orthography" ✅
  // - ", Gniva dialect, in standardized 1994 orthographyResian" ❌
  // - ", in standardized 1994 orthographyResian" ❌
  // - ", Stolvizza dialectResian" ❌
};

I also still don't agree with this sentiment:

In practice, the variants are only useful in very specific applications, most of which have nothing to do with locales.

Again, maybe I'm misunderstanding something, but en-basiceng, de-1901, sgn-ase-blasl, sa-itihasa and el-polyton all seem like valid and not too niche uses of variants. And even if these are considered niche or "specific", I don't think the goal of i18n/l10n is to only consider commonly used and general cases. 😜

I'm not saying this is the most important thing ever, but I also wouldn't disregard variants as something "deprecated" or "only useful in very specific applications, most of which have nothing to do with locales".


So all in all, I think the ordering of Intl.Locale.prototype.variants or rather Intl.Locale.prototype.getVariants simply shouldn't be defined as long as the ordering of variants in Intl.Locale.prototype.toString also isn't defined. In case it either is defined and I just overlooked it, or it absolutely needs to be specified, I think following suit wtih getTimeZones and getCollections would be fine, which would simply be alphabetic order (or "lexicographic code unit order" to be precise).

nnmrts avatar Dec 05 '24 02:12 nnmrts

For reference, how the ordering issue has been "solved" in a past issue: https://github.com/tc39/ecma402/issues/330#issuecomment-2103421993

nnmrts avatar Dec 05 '24 04:12 nnmrts

Sorry for the triple comment but here a quote from UTS 35 which ECMA402 follows, as far as I understand now:

NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering in Section 4.1 of BCP 47. Here are the considerations that lead to that decision:

  • The ordering in is recommended, but not required for conformance. In particular, use of and ordering by Prefix is recommended but not required.
  • Moreover, Section 4.5 states that “If more than one variant appears within a tag, processors MAY reorder the variants to obtain better matching behavior or more consistent presentation.”
  • The best practices for internationalization have moved well beyond some of the guidelines and recommendations in BCP 47, especially for language matching and language fallback.
  • Robust implementations will accept the variants in any order, just as they accept extensions in any order.
  • A canonical order allows for determination of identity of identifiers via string comparison.
  • The ordering in does not result in a determinant order for canonicalization, since the mechanism for determining “importance” is not specified: ca-valencia-fonipa and ca-fonipa-valencia could both be ‘canonical’ variants of one another.
  • Pure alphabetical order is determinant and simple to implement while the ordering in is indeterminant, more complex, and provides no significant benefit in modern applications.

nnmrts avatar Dec 05 '24 04:12 nnmrts

So all in all, I think the ordering of Intl.Locale.prototype.variants or rather Intl.Locale.prototype.getVariants simply shouldn't be defined as long as the ordering of variants in Intl.Locale.prototype.toString also isn't defined. In case it either is defined and I just overlooked it

For the record, it is defined:

  1. The Intl.Locale constructor returns a locale object whose [[Locale]] slot is set to the [[locale]] field of the Record returned from MakeLocaleRecord(tag, opt, localeExtensionKeys).
    1. MakeLocaleRecord returns a result Record whose [[locale]] field is the value returned from either CanonicalizeUnicodeLocaleId(locale) or InsertUnicodeExtensionAndCanonicalize(locale, attributes, keywords) (the latter of which always returns the result of an internal use of the former).
    2. CanonicalizeUnicodeLocaleId returns a String that starts with "the String value resulting from performing the algorithm to transform locale to canonical form per Unicode Technical Standard #35 Part 1 Core, Annex C LocaleId Canonicalization".
    3. UTS 35, as noted above, starts with a Canonicalizing Syntax step that includes "Put any variants into alphabetical order (eg, en-fonipa-scouse, not en-scouse-fonipa)".
  2. Intl.Locale.prototype.toString just returns the [canonicalized] contents of its receiver's [[Locale]] slot.

gibson042 avatar Dec 05 '24 16:12 gibson042

Ok, so .variants returning an array with the variants in UTS 35 order would be most consistent with the rest of the spec, and I think functions such as describeSlovenianLanguageUsed could reconstruct what they need from the variants, even if they are in UTS 35 order. Is that accurate?

sffc avatar Dec 05 '24 20:12 sffc

@sffc That seems logical to me with the only addition that it might needs to be .getVariants instead.

nnmrts avatar Dec 06 '24 01:12 nnmrts

Discussion today identified another design question: representation as a primitive dash-separated string with a simple variants getter (presumably to be split for inspection), or as a fresh array of strings from getVariants() (presumably to be joined for comparison).

There was also a request for motivation, and the most widely-used example I can think of is the Wikipedia IPA template, which adds markup indicating the "fonipa" variant and can be used for styling/hooking advanced functionality/etc.

gibson042 avatar Mar 07 '25 00:03 gibson042

Discussion today identified another design question: representation as a primitive dash-separated string with a simple variants getter (presumably to be split for inspection), or as a fresh array of strings from getVariants() (presumably to be joined for comparison).

There was also a request for motivation, and the most widely-used example I can think of is the Wikipedia IPA template, which adds markup indicating the "fonipa" variant and can be used for styling/hooking advanced functionality/etc.

Can you link to the discussion? "Deprecating" variants would tend to seriously marginalize language communities.

srl295 avatar Mar 07 '25 03:03 srl295

Notes will be added to meetings soon, but no one is talking about deprecating variants—just about whether or not to expose them via a dedicated interface of Intl.Locale instances, the way that language/script/region and baseName (the full unicode_language_id including them) already are.

gibson042 avatar Mar 07 '25 18:03 gibson042

Notes will be added to meetings soon, but no one is talking about deprecating variants—just about whether or not to expose them via a dedicated interface of Intl.Locale instances, the way that language/script/region and baseName (the full unicode_language_id including them) already are.

Thanks. That would still be problematic and incomplete

Some variants are better represented as extension keywords. Example: valencia is better as -u-sd-esvc

Some, but not all. A variant may not fit into an exact geopolitical boundary. Also while variant implementation is extant but problematic, such extensions have fewer, if any, implementations.

srl295 avatar Mar 07 '25 18:03 srl295

no one is talking about deprecating variants—just about whether or not to expose them via a dedicated interface of Intl.Locale instances, the way that language/script/region and baseName (the full unicode_language_id including them) already are.

Thanks. That would still be problematic and incomplete

It is the current state of affairs; cf. MDN documentation and the spec text. The best way to get variants from a locale identifier right now is manual parsing, e.g.

const variantsFromLocaleId = localeId =>
  (
    new Intl.Locale(localeId).baseName.replace(
      /^[a-z]+(-([a-z]{4}|[a-z]{2}|[0-9]{3})\b)*-?/i,
      '',
    ) || undefined
  )?.split('-');

gibson042 avatar Mar 07 '25 18:03 gibson042