icu4x ICU4X objects that try_new with a provider should store and expose the resolved locale

trafficstars

ECMA-402 requires various objects to be able to expose the resolved options. Common across different types is the resolved locale.

For ECMA-402 compat, we should make various ICU4X objects call take_metadata_and_payload instead of take_payload when loading their primary provider-backed payload and store the DataLocale from the metadata. We should then have a convention across ICU4X for retrieving that Locale from the ICU4X object.

The finer points of DataLocale vs. Locale are unclear to me, so I'm not sure if the convention should be fn resolved_locale(&self) -> &DataLocale allowing the application to call .into_locale() or fn resolved_locale(&self) -> Locale.

Aug 22 '23 12:08 hsivonen

What is the "primary provider-backed payload"? It's not well defined in all cases.

The whole concept of the "resolved locale" is fraught and it's not up to ICU4X to perpetuate it. #58 has suggestions for how to solve the 402 problem in user land.

Aug 22 '23 18:08 sffc

What is the "primary provider-backed payload"? It's not well defined in all cases.

For the collator, it would be the payload for CollationMetadataV1Marker.

The whole concept of the "resolved locale" is fraught and it's not up to ICU4X to perpetuate it. https://github.com/unicode-org/icu4x/issues/58 has suggestions for how to solve the 402 problem in user land.

My understanding of provider internals is insufficient for understanding what exactly #58 proposes for the ECMA-402 resolved locale info, especially in the case where the glue code opted to implement "best fit" by delegating to ICU4X's lookup mechanism.

Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?

Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?

Aug 23 '23 06:08 hsivonen

Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?

I think it would be fine if ICU4X suggested which key to use for the resolved locale in cases required by 402

Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?

Neither. The data provider should be instrumented to get the resolved locale out of the DataResponseMetadata.

Aug 23 '23 20:08 sffc

Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?

I think it would be fine if ICU4X suggested which key to use for the resolved locale in cases required by 402

Does it make sense to merely suggest it as opposed to providing a concrete crate for it with documentation that the crate is only provided for 402 compat? Or providing a Cargo option to enable such code in each component directly? I wouldn't mind if the option was named to discourage use along the lines of enable_conceptually_questionable_resolved_locale_for_ecma_402_compat_only.

Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?

Neither. The data provider should be instrumented to get the resolved locale out of the DataResponseMetadata.

The example seems to preclude the use of the baked-mode constructors that don't take a provider argument, which is unfortunate. I'm not sure, but my initial reaction is that I'd rather do a duplicative lookup if the JS app looks at the resolved options than defeat the baked code path for object construction.

Aug 24 '23 07:08 hsivonen

Would it be incorrect to always return the requested locale as the resolved locale?

Aug 24 '23 08:08 robertbastian

Would it be incorrect to always return the requested locale as the resolved locale?

That question can be understood in at least three senses: 1) what the caller wants to know if they care to actually inspect the resolved locale, 2) what's Web-compatible, 3) what fits within the spec's notion of implementation-defined.

In sense 1, incorrect. (Most notably, if the requested locale has a non-language component and the resolved locale does not retain that component, this shows that the implementation's data does not explicitly alter the main flavor of the language in a way that the component would change. Is this actionable information for the caller? Perhaps not.)

In sense 2, probably not Web-compatible considering that things that deviate from what major browsers do tends not to be Web-compatible but maybe Web-compatible in the sense that the information isn't really that actionable anyway.

In sense 3, maybe not strictly incorrect if you read all implementation-defined behavior as not required to even make sense and the observer not getting an infinite number of observations.

Aug 24 '23 09:08 hsivonen

Returning the requested locale as the resolved locale gives the same result as preresolving locales at datagen time. If we don't have es-419 data and we fall back to es, I don't think the information whether the data is actually es-419 is actionable.

The bigger problem is falling back to und. Maybe we can have a flag in locid_transform that changes fallback behaviour to return an error when und is reached. This would need to be key-dependent, as we know for collation, for example, fallback to und is fine.

I agree that a solution that is compatible with compiled data would be preferrable.

Aug 24 '23 10:08 robertbastian

I didn't look at the source, but I think Intl.Segmenter, which is a rather special case in its relationship to locale data, in Chrome fakes its resolved locale roughly by 1) considering languages that CLDR knows about in some sense as supported and 2) retaining the region if the region is CLDR-known to be associated with the language (sv-FI retains the region, but fi-SE does not).

Aug 24 '23 10:08 hsivonen

Do you know any concrete uses of this information, which would break if we deviate? I don't want to let Chrome dictate how standards should be interpreted.

Aug 24 '23 10:08 robertbastian

From a very quick look at GitHub search, I see one use case beyond test cases and debug logging:

Determining the host locale by executing Intl.DateTimeFormat().resolvedOptions().locale per StackOverflow and various other teaching materials.

So perhaps just echoing back the requested locale could work and not break the Web.

Of course, this only looks at the case where .locale is appended directly to .resolvedOptions() instead of there being an intermediate variable.

Aug 24 '23 10:08 hsivonen

Yeah, echoing back the requested locale probably works. I think the most useful piece of information you can get is which one out of a list of locales you got. For example, if the locales requested were ["ff", "fr", "ar"], it is potentially useful to know that "fr" was chosen out of that list, so that you can for example render other components in that language.

Aug 24 '23 15:08 sffc

Yeah, echoing back the requested locale probably works.

Should ECMA-402 change to require this?

Aug 25 '23 09:08 hsivonen

@zbraniecki Thoughts on the above?

Aug 25 '23 17:08 sffc

The web reality is that it will return the closest locale the engine had data for:

(new Intl.DateTimeFormat("es-FR")).resolvedOptions().locale == "es"

Aug 25 '23 20:08 zbraniecki

The key question is whether ICU4X should push the first implementation to ship ICU4X-backed ECMA-402 to the Web to bear the cost of finding out if deviating from the current Web reality is Web-compatible.

Given that ICU4X is supposed to work as an ECMA-402 back end, it would be rather odd for ICU4X to resist being able to implement what ECMA-402 currently says in a similar way to how deployed implementations do it.

Perhaps echoing back the requested locale would work, but is e.g. Chrome willing to try it out to see if it's Web-compatible?

I suggest doing what I originally requested but behind a repulsively-named Cargo option. That is, if conceptually_questionable_resolved_locale_for_ecma_402_compat_only is enabled, each ICU4X objects that has try_new with locale would gain a field for the resolved locale, a getter for that field, and would store the locale from the appropriate payload metadata in that field in the code that loads the data.

Aug 30 '23 13:08 hsivonen

@sffc first problem: it's unclear how clients use the resolved locale. It might be useful to figure out whether you fell back to und or to determine which locale from a list got used.
@sffc second problem: locale you end up falling back to is completely arbitrary, based on data. If you request e.g. hi-Latn it may fall back to something like English, which is valid, even if it looks like you did a fallback to the host locale or elsewhere. To get the ECMA/Chrome behavior you would need to run datagen in hybrid mode so we can remember how things resolved.
@markusicu - I feel sorry for everyone who has to deal with this because of ECMA. We've dealt with this in ICU4C/ICU4J and also have struggled with it. It's messy and confusing and unsatisfying. In ICU we have an API that lets you get "requested", "valid", and "actual" locale. The "requested" locale, we deleted a while ago. "Valid" and "actual" have been draft for a long time because we couldn't figure out what it should really do. "Actual" should say where the data comes from since the ICU model is equivalent to the CLDR model. "de-AT" might get data from "de" or from root. The messy thing of course is that services load data from multiple sources, like datetime format. For collator, there's often a clear answer. But if you have an alphabetic index, there might be data coming from multiple sources. You can choose one or the other but it doesn't always make sense.
@hsivonen - Currently ECMA-402 does not let you know that you fall back to root because you fall back to the host locale first. You can't tell if you got the host locale because of fallback or because the requested locale resolves to the same as the host locale. The use case as far as I can tell is one stack overflow question where you can read the host locale. But as an engine developer it is risky to do something different than the browser with the larger market share. So considering that ICU4X has ECMA-402 as a design goal, it seems like it would make sense to have a cargo option to opt in.
@markusicu - Can the ECMA API be deprecated?
@Manishearth - The tricky thing is that what you really care about is, did fallback go past some threshold? The problematic threshold is both usecase and key dependent. hi-Latn falling back to en maybe is okay for symbols but not patterns. Maybe en-IN is okay but not en-001. You could come up with a strategy for plurals saying, it's okay for it to fall back across countries but not across languages otherwise. For datetime symbols, you could have fallback multiple times because there are multiple keys, theoretically capable of multiple answers, barring runtime fallback mode. If you really want the correct answer, you need hybrid mode, not runtime fallback mode. The fact that it's key-dependent means I don't see a way of doing this well, and that's even before deciding the threshold for each use case. The main use case seems to be, if fallback went too far, you need to load more data. But "going too far" is not something we can define, it's use case dependent and key dependent and a mess.
@Manishearth - can you link to the stackovverflow q?
@robertbastian - https://stackoverflow.com/questions/673905/how-to-determine-users-locale-within-browser
@robertbastian - We support this in ICU4X if you have custom data. This is the ECMA wrapper's problem. I also understand the web compat risk. Can we work with Chrome to change the behavior and change the ECMA-402 spec to clarify what the resolved locale actually means? Because we as the ICU4X and Unicode team can't come up with use cases for this.
@markusicu - the use case of inspecting the getLocale() is better served via an input that says what fallback is acceptable. we have started adding that in some ICU APIs ("noSubstitute" boolean), basically saying whether it's ok to fall back to the default locale.
@markusicu - if people care about fallback-too-far, then the data provider should signal that, rather than the object built from it
@sffc +1 we should clarify on 402 side. But we can already do this in ICU4X, have a tutorial for this. You gen in hybrid mode, and then pick a key and use this tutorial to access the resolved locale for that key. Tricky API, but consistent with our design. I do not like having a Cargo feature for this, because storing a locale will increase the size of all our internal formatters. We're actively trying to reduce global feature flags, because they are very infectious in some build systems.
@hsivonen: If we want to use the compiled data APIs we would have to do a second lookup. I'd prefer adding an API and field behind features [...]
@sffc: You can do this without a second lookup
@Manishearth - The design is incomplete because there's not a single answer
@hsivonen - I want something that's close enough for what Chrome does for 402
@Manishearth - That sounds too niche; we should give you the tools to build it yourself. Shane's tutorial shows how to do it. We could design an FFI or non-FFI API for this that stores every key that was resolved. That's an okay data structure to design since it's a fancier version of what's in the tutorial. We could stick it in the adapters. Browser implementations could use that. So I like a path where we build a data structure...
@sffc: I like that proposal. About what Henri said about not being able to use the compiled data APIs: it makes me a little sad, these APIs are designed as a low-friction entry point, anyone who wants to do beyond-bare-bones things should use the full feature set we have. We've added compiled data APIs not for power users like Firefox. I don't think it's in scope to add a feature, it's a slippery slope to add a flag to use this with compiled data. I think we need to keep compiled data extremely clean and simple, we can build more abstractions outside compiled data.
@robertbastian - compiled_data uses runtime fallback, so we can't support this anyway because you need data built in hybrid mode. I don't even want to expose this as an adapter, it feels like an antipattern to look at the resolved locale, and users should not be doing this. There will probably be at most two clients who will use ICU4X to implement ECMA402, and they should use custom optimized wrappers
@hsivonen - If Chrome were to take the position that this stuff is silly, let's echo back the requested locale, that would be great. Putting in the spec that it is deprecated doesn't show whether echoing the request back breaks the Web.
@Manishearth - We have more than 2 use cases. There are many data loading strategies that need something like this. If you're doing runtime data loading, you may need a catalog of all the data you have. In Dart Intl, there was some talk on runtime data loading like that.
@sffc: we can probably update the 402 spec to echo the normalized resolved locale and have Frank implement this in Chrome.
@robertbastian - I still think the standalone adapters API is not necessary; it's very simple to do this yourself and whatever we do will not work for everyone (mutex or not, what exactly to store, etc.).
@Manishearth/@sffc - FFI.
@robertbastian - Okay, we could add it to FFI but not Rust.
@hsivonen - Do we know how much Boa wants to match Node/Chrome behavior?
@sffc - Maybe we should tag Jedel on the issue.

Aug 31 '23 18:08 sffc

@jedel1043 , see above:

Do we know how much Boa wants to match Node/Chrome behavior?

Sep 01 '23 08:09 hsivonen

@jedel1043 , see above:

Do we know how much Boa wants to match Node/Chrome behavior?

Personally, I think it's alright if Boa has to deviate from V8 in order to offer better locale results.

Sep 01 '23 08:09 jedel1043

Based on https://github.com/tc39/ecma402/issues/830#issuecomment-1711965221, there could a way for datagen to store the list of locales for which it generated data and then some API to access that list. Needs design work.

Discuss with:

@sffc
@robertbastian
@hsivonen

Optional:

@Manishearth
@zbraniecki

Oct 19 '23 17:10 sffc

Can we merge the discussion into #58? It seems to be going the same direction

Feb 28 '24 15:02 robertbastian

In favor of merging.

Feb 28 '24 16:02 Manishearth

Not exactly the same set of questions. This thread is about ResolvedLocale and #58 is about SupportedLocales. The solutions may overlap.

Feb 29 '24 06:02 sffc

In #58 we have concluded that exposing a set of supported locales is not feasible, but determining the resolved locale is. You literally have a draft ResolvedLocalesAdapter for that issue, so I really struggle to understand what the difference is.

Feb 29 '24 08:02 robertbastian

I changed https://github.com/unicode-org/icu4x/pull/4607 to be closing this issue.

Feb 29 '24 21:02 sffc

@hsivonen - In Segmenter and Collator, locales often fall back to root. Is this solution for APIs that process or consume natural language content? ICU4C retains knowledge of which languages it explicitly knew about but which have root as valid. One bug is that Sanskrit is listed as known, but that is a bug since the Sanskrit file is draft, so the data still falls back to root. For Segmenter, in Chrome and Firefox, supported-locale queries are completely bogus: it has nothing to do with segmentation. It queries some data like, does CLDR know about this language? I think it checks DateTime formatting. For APIs that process language, are there more cases where this is interesting than Collator?
@sffc - Do we retain the data about whether the language is supported in Collator?
@hsivonen - Not right now. We would need to add that to icuexportdata. But, it would be okay to manually maintain the list because it is unlikely to change. Actually that list could just go into the ECMA-402 implementation.
@sffc - My takeaway is that the model we've designed for resolved/supported locales should be good for the formatter use case, and it is generalizable enough to support the language processing APIs.
@hsivonen - If we have a list for Collator, we could look at the data files once, write it down, and not make genrb poke at the APIs.

Mar 12 '24 14:03 sffc

Also see comment from @mihnita in https://github.com/unicode-org/icu4x/issues/2237#issuecomment-1201557171

Jun 27 '24 12:06 sffc

2.0 blocking question: does baked data do this? From preliminary tests there might a non-trivial size impact, I'll get some better numbers.

Jun 27 '24 12:06 robertbastian

I still think retain-base-languages should fix this, but currently it does not fill in the missing locales:

$ cargo run -p icu4x-datagen -- --deduplication retain-base-languages --markers collator/data@1 collator/meta@1 --locales full --format dir --out /tmp/collator_retain -W
warning: ignoring `resolver` config table without `-Zmsrv-policy`
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.11s
     Running `target/debug/icu4x-datagen --deduplication retain-base-languages --markers 'collator/data@1' 'collator/meta@1' --locales full --format dir --out /tmp/collator_retain -W`
2024-08-08T18:07:18.825Z INFO  [icu_provider_export::export_impl] Datagen configured with deduplication retaining base languages, and these locales: ["<all>"]
2024-08-08T18:07:19.387Z INFO  [icu_provider_export::export_impl] Generated marker collator/meta@1 (0.561s, 'si/dict' in 1.829ms, flushed in 4.183µs)
2024-08-08T18:07:19.529Z INFO  [icu_provider_export::export_impl] Generated marker collator/data@1 (0.704s, 'zh/stroke' in 0.352s, flushed in 4.723µs)
$ ls /tmp/collator_retain/collator/meta@1/
af.json  br.json       dict              et.json       gu.json   ig.json  kok.json  ml.json  pa.json   sk.json       te.json   unihan
am.json  bs-Cyrl.json  dsb.json          fa.json       ha.json   is.json  ku.json   mn.json  phonebk   sl.json       th.json   ur.json
ar.json  bs.json       ee.json           ff-Adlm.json  haw.json  ja.json  ky.json   mr.json  phonetic  smn.json      tk.json   uz.json
as.json  ceb.json      el.json           fi.json       he.json   ka.json  lkt.json  mt.json  pl.json   sq.json       to.json   vi.json
az.json  chr.json      emoji             fil.json      hi.json   kk.json  ln.json   my.json  ps.json   sr.json       trad      wo.json
be.json  compat        en-US-posix.json  fo.json       hr.json   kl.json  lo.json   ne.json  ro.json   sr-Latn.json  tr.json   yi.json
bg.json  cs.json       eo.json           fr-CA.json    hsb.json  km.json  lt.json   no.json  ru.json   stroke        ug.json   yo.json
bn.json  cy.json       eor               fy.json       hu.json   kn.json  lv.json   om.json  se.json   sv.json       uk.json   zh.json
bo.json  da.json       es.json           gl.json       hy.json   ko.json  mk.json   or.json  si.json   ta.json       und.json  zhuyin

Aug 08 '24 18:08 sffc

@hsivonen Assumed the thing that @sffc and @robertbastian agreed to would solve this, but seen no attention payed to since
@sffc We have since implemented SupportedLocale and ResolveLocale (https://github.com/unicode-org/icu4x/issues/58 resolved but not closed)
@hsivonen Main webcompat issue is that the web expects certain locales that ICU4C knows are "known" locales (en, fr) even though they fall back to root. We could hack around this with somewhere to stick the list of languages that ICU4C knows fall back to root (english, french, dutch, portuguese, italian, etc.), with the exception of sanskrit. Web exposed behavior can be to pretend that they resolve to themselves.
@sffc If I run icu4x-datagen with new flags: --deduplication retain-base-languages --markers collator/data@1 --locales full --format dir, languages missing are?
- @hsivonen It doesn't have en.json, fr.json, etc. (even with collator/meta@1 it still doesn't retain base languages)
@sffc - I think I see why. retain-base-languages doesn't include languages unknown to data, such as Klingon tlh. As far as collator data is concerned, en is like Klingon because there is no data for it. We could change retain-base-languages behavior, or we could add en to the ICU Export Data.
@Manishearth - It also sounds that there is a workaround for browsers that will work. I would like if ICU4X has the 402 weirdness available in some way.
@sffc - It sounds like this doesn't block 2.0 from an API standpoint, and we have paths forward.

Aug 08 '24 18:08 sffc

icu4x icu4x copied to clipboard

ICU4X objects that try_new with a provider should store and expose the resolved locale

icu4x
icu4x copied to clipboard