ecma402 icon indicating copy to clipboard operation
ecma402 copied to clipboard

Section 9.1 AvailableLocales shouldn't require base language if script is present

Open sffc opened this issue 11 months ago • 4 comments
trafficstars

It seems like it should be allowed for AvailableLocales to support zh-Hant but not zh, since zh implies zh-Hans. However, the spec currently states:

Additionally, for each element with more than one subtag, it must also include a less narrow language tag with the same language subtag and a strict subset of the same following subtags (i.e., omitting one or more) to serve as a potential fallback from ResolveLocale.

sffc avatar Dec 10 '24 01:12 sffc

since zh implies zh-Hans

Is there any spec text supporting this claim? I would consider it perfectly reasonable for an implementation that has data for "zh-Hant" but not "zh-Hans" to use the former in service of requested locale "zh", and any application that specifically warrants "zh-Hans" should be specific.

On the other hand, it would seem bizarre and in violation of the spirit (if not also the letter) of BCP 47 to support narrow data in absence of covering broad data. Some excerpts:

  • «A language tag is composed from a sequence of one or more "subtags", each of which refines or narrows the range of language identified by the overall tag»
  • «In the lookup scheme, the language range is progressively truncated from the end until a matching language tag is located. Single letter or digit subtags (including both the letter 'x', which introduces private-use sequences, and the subtags that introduce extensions) are removed at the same time as their closest trailing subtag.»
  • «For example, a user who reads both Simplified and Traditional Chinese, but who prefers Simplified, might use the range "zh" for filtering (matching all items that user can read) but "zh-Hans" for lookup (making sure that user gets the preferred form if it's available, but the fallback to "zh" will still work)»
  • «Whether a subtag adds distinguishing value can depend on the context of the request… If the user cannot be sure which scheme is being used (or if more than one might be applied to a given request), the user SHOULD specify the most specific (largest number of subtags) range first and then supply shorter prefixes later in the list to ensure that filtering returns a complete set of tags.»

I don't think that's invalided by Unicode likelySubtags logic, which improves "best case" results but should not preempt such "worst case" scenarios.

gibson042 avatar Dec 19 '24 20:12 gibson042

TG2 discussion: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-12-19.md#section-91-availablelocales-shouldnt-require-base-language-if-script-is-present-947

sffc avatar Dec 21 '24 05:12 sffc

CC @eemeli

I think my main problem here is that the spec requires zh to be supported if zh-Hant is supported, even if the user agent does not believe that this is an appropriate fallback. A user agent can and should be allowed to strictly follow "likely subtags" locale matching.

sffc avatar Jan 22 '25 23:01 sffc

even if the user agent does not believe that this is an appropriate fallback

I don't agree with the characterization of "zh" as a fallback for "zh-Hant"... it's a generalization that uses the same language but lacks any specification of script, and therefore encompasses "Hant" but also "Hans" (and for that matter, "Hani" and "Hanb" and "Latn" as well).

A user agent can and should be allowed to strictly follow "likely subtags" locale matching.

Well, lookup by prefix is pretty strongly baked in via localeMatcher: "lookup" and required by RFC 4647 section 3.4, and disregarding it would seem to necessarily be backwards-incompatible. I don't consider such changes lightly, so could you provide a concrete example (i.e., using ECMA-402 APIs) of how it would be beneficial?

gibson042 avatar Jan 23 '25 07:01 gibson042

This is question is about LookupMatchingLocaleByBestFit vs LookupMatchingLocaleByPrefix, I think.

Both ICU4C and ICU4X, in their default "best fit" fallback, consider zh-TW to fall back to zh-Hant and then to root. zh is never considered as part of fallback. Similarly, sr-ME falls back to sr-Latn and then to root, not to sr, which is Cyrillic.

There are numerous other examples of parent overrides, such as en-GB inheriting from en-001 before it inherits from en.

According to the current spec, though, including zh-TW means the engine must also include zh. So, what do we put in zh in an engine that wishes to support zh-TW but doesn't want to ship zh-Hans data?

  1. Put Hant data into zh. This is problematic because zh is well-known throughout the industry to mean zh-Hans, and if zh-Hans data were on the system, then zh would start meaning Hans.
  2. Put root data into zh. This is problematic because ECMA-402 should signal that zh is unsupported rather than implicitly filling in root data.

Does this help clarify @gibson042?

sffc avatar Aug 13 '25 22:08 sffc

This is question is about LookupMatchingLocaleByBestFit vs LookupMatchingLocaleByPrefix, I think.

Both ICU4C and ICU4X, in their default "best fit" fallback, consider zh-TW to fall back to zh-Hant and then to root. zh is never considered as part of fallback. Similarly, sr-ME falls back to sr-Latn and then to root, not to sr, which is Cyrillic.

There are numerous other examples of parent overrides, such as en-GB inheriting from en-001 before it inherits from en.

Yes, you're just reiterating "best fit".

According to the current spec, though, including zh-TW means the engine must also include zh. So, what do we put in zh in an engine that wishes to support zh-TW but doesn't want to ship zh-Hans data?

  1. Put Hant data into zh. This is problematic because zh is well-known throughout the industry to mean zh-Hans, and if zh-Hans data were on the system, then zh would start meaning Hans.

That's in line with what I said in https://github.com/tc39/ecma402/issues/947#issuecomment-2555737527 :

I would consider it perfectly reasonable for an implementation that has data for "zh-Hant" but not "zh-Hans" to use the former in service of requested locale "zh", and any application that specifically warrants "zh-Hans" should be specific.

(although IMO implementations should be strongly discouraged from shipping such gaps, and in practice I doubt it's a real issue)

  1. Put root data into zh. This is problematic because ECMA-402 should signal that zh is unsupported rather than implicitly filling in root data.

By what mechanism would ECMA-402 signal lack of support, and for what purpose? Remember, this only applies when code explicitly opts in to "lookup" matching, which specifically works by truncation.

gibson042 avatar Aug 14 '25 01:08 gibson042

Privacy/fingerprint concerns aside, an engine might choose to add additional languages during the runtime. This came up in the case of the text translation API being proposed in W3C. The definition of zh should not depend on whether Hant or Hans happened to have been loaded first.

sffc avatar Aug 14 '25 07:08 sffc

Observably adding languages within the lifetime of an agent must be prohibited for the same reasons that observably altering time zone data is prohibited (cf. Use of the IANA Time Zone Database, GetAvailableNamedTimeZoneIdentifier, and GetAvailableNamedTimeZoneIdentifier). If we need to clarify that in e.g. Internal slots of Service Constructors, I'm happy to do so.

gibson042 avatar Aug 19 '25 02:08 gibson042

@gibson042 Do you mean that we ought to tighten further the paragraph recently added to the Implementation Dependencies section?

In browser implementations the initial set of locales, currencies, calendars, numbering systems, and other enumerable items visible to a particular origin must be the same for all users sharing the same user agent string (engine and platform version). Furthermore, dynamic changes to these sets must not result in users becoming distinguishable from each other. This constraint is imposed to reduce the fingerprinting risk inherent in internationalization, and may be relaxed in future revisions. As a result of this constraint, the first time a browser implementation that allows on-demand locale installation receives a request from a particular origin that could require installing a new locale, it must not reveal whether or not that locale is already installed.

eemeli avatar Aug 19 '25 04:08 eemeli

Huh? We've had extensive discussion about engines adding locales at runtime. That's very much in the scope of what a compliant ECMA-402 implementation could or should be able to do, even if none do it right now.

sffc avatar Aug 19 '25 06:08 sffc

It seems unrealistic for an implementation that is under substantial interop constraints to ship without zh-Hans data and without capability of loading it at run-time in a way that doesn't affect what zh means. An embedded system that needs to support zh-Hant but does not have space for zh-Hans data is going to do what it needs to do. In that sense, I think it's not particularly good use of TG time to tweak spec language on that point.

However, in the Gecko bug to migrate to the ICU4X collator, @anba pointed out that reporting the ICU4X-internal outcome of ICU4X's collation data resolution would not be 402-compliant.

Even ignoring the issue of whether there's a list of languages for which the root is attested as valid such that the resolved locale for en is en as opposed to root or und, 402 seems to require the resolved locale for zh-HK to be zh-HK as opposed to something else. The resolved collation locale for zh-HK inside ICU4X is und-Hani/stroke, so / vs. -u-co- aside, the internal modeling of the situation doesn't retain any of zh, HK, or even Hant.

At this point, what can be reported by 402 APIs is most likely to be more of a Web compat research matter than a matter of what would be good i18n design. (Currently, browsers say (new Intl.Collator(["en-US"])).resolvedOptions().locale == "en-US" but (new Intl.Collator(["en-IN"])).resolvedOptions().locale == "en" even though both use root data, and (new Intl.Collator(["ar-SA"])).resolvedOptions().locale == "ar-SA" but (new Intl.Collator(["ar-TN"])).resolvedOptions().locale == "ar" even though there is no SA data separate from ar. Does the Web depend on this? Does ICU4C ship this oddity for Web compat or for compat with some other legacy ICU clients?)

For now, my working assumption is that SpiderMonkey is going to retain its own outside-ICU4C and outside-ICU4X 402-compliant locale resolution code that has the notion of available non-collation locales and the notion of available collation locales that are informed enough of ICU4C and ICU4X capabilities so that performing locale resolution on those lists outside ICU4C or ICU4X results in a "resolved locale" that yields reasonable behavior when used is input to ICU4C's or ICU4X's locale resolution.

ICU4X's collation locale resolution working on script rather than language for the collations that come from CLDR's zh.xml makes it relevant to ask if SpiderMonkey's available locale list for collation should include some non-zh language codes for languages that use either Hans or Hant script if SpiderMonkey migrates to ICU4X. For example, should yue be there?

hsivonen avatar Aug 19 '25 10:08 hsivonen