icu4x
icu4x copied to clipboard
ICU4X objects that try_new with a provider should store and expose the resolved locale
ECMA-402 requires various objects to be able to expose the resolved options. Common across different types is the resolved locale.
For ECMA-402 compat, we should make various ICU4X objects call take_metadata_and_payload instead of take_payload when loading their primary provider-backed payload and store the DataLocale from the metadata. We should then have a convention across ICU4X for retrieving that Locale from the ICU4X object.
The finer points of DataLocale vs. Locale are unclear to me, so I'm not sure if the convention should be fn resolved_locale(&self) -> &DataLocale allowing the application to call .into_locale() or fn resolved_locale(&self) -> Locale.
What is the "primary provider-backed payload"? It's not well defined in all cases.
The whole concept of the "resolved locale" is fraught and it's not up to ICU4X to perpetuate it. #58 has suggestions for how to solve the 402 problem in user land.
What is the "primary provider-backed payload"? It's not well defined in all cases.
For the collator, it would be the payload for CollationMetadataV1Marker.
The whole concept of the "resolved locale" is fraught and it's not up to ICU4X to perpetuate it. https://github.com/unicode-org/icu4x/issues/58 has suggestions for how to solve the 402 problem in user land.
My understanding of provider internals is insufficient for understanding what exactly #58 proposes for the ECMA-402 resolved locale info, especially in the case where the glue code opted to implement "best fit" by delegating to ICU4X's lookup mechanism.
Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?
Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?
Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?
I think it would be fine if ICU4X suggested which key to use for the resolved locale in cases required by 402
Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?
Neither. The data provider should be instrumented to get the resolved locale out of the DataResponseMetadata.
Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?
I think it would be fine if ICU4X suggested which key to use for the resolved locale in cases required by 402
Does it make sense to merely suggest it as opposed to providing a concrete crate for it with documentation that the crate is only provided for 402 compat? Or providing a Cargo option to enable such code in each component directly? I wouldn't mind if the option was named to discourage use along the lines of enable_conceptually_questionable_resolved_locale_for_ecma_402_compat_only.
Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?
Neither. The data provider should be instrumented to get the resolved locale out of the DataResponseMetadata.
The example seems to preclude the use of the baked-mode constructors that don't take a provider argument, which is unfortunate. I'm not sure, but my initial reaction is that I'd rather do a duplicative lookup if the JS app looks at the resolved options than defeat the baked code path for object construction.
Would it be incorrect to always return the requested locale as the resolved locale?
Would it be incorrect to always return the requested locale as the resolved locale?
That question can be understood in at least three senses: 1) what the caller wants to know if they care to actually inspect the resolved locale, 2) what's Web-compatible, 3) what fits within the spec's notion of implementation-defined.
In sense 1, incorrect. (Most notably, if the requested locale has a non-language component and the resolved locale does not retain that component, this shows that the implementation's data does not explicitly alter the main flavor of the language in a way that the component would change. Is this actionable information for the caller? Perhaps not.)
In sense 2, probably not Web-compatible considering that things that deviate from what major browsers do tends not to be Web-compatible but maybe Web-compatible in the sense that the information isn't really that actionable anyway.
In sense 3, maybe not strictly incorrect if you read all implementation-defined behavior as not required to even make sense and the observer not getting an infinite number of observations.
Returning the requested locale as the resolved locale gives the same result as preresolving locales at datagen time. If we don't have es-419 data and we fall back to es, I don't think the information whether the data is actually es-419 is actionable.
The bigger problem is falling back to und. Maybe we can have a flag in locid_transform that changes fallback behaviour to return an error when und is reached. This would need to be key-dependent, as we know for collation, for example, fallback to und is fine.
I agree that a solution that is compatible with compiled data would be preferrable.
I didn't look at the source, but I think Intl.Segmenter, which is a rather special case in its relationship to locale data, in Chrome fakes its resolved locale roughly by 1) considering languages that CLDR knows about in some sense as supported and 2) retaining the region if the region is CLDR-known to be associated with the language (sv-FI retains the region, but fi-SE does not).
Do you know any concrete uses of this information, which would break if we deviate? I don't want to let Chrome dictate how standards should be interpreted.
From a very quick look at GitHub search, I see one use case beyond test cases and debug logging:
Determining the host locale by executing Intl.DateTimeFormat().resolvedOptions().locale per StackOverflow and various other teaching materials.
So perhaps just echoing back the requested locale could work and not break the Web.
Of course, this only looks at the case where .locale is appended directly to .resolvedOptions() instead of there being an intermediate variable.
Yeah, echoing back the requested locale probably works. I think the most useful piece of information you can get is which one out of a list of locales you got. For example, if the locales requested were ["ff", "fr", "ar"], it is potentially useful to know that "fr" was chosen out of that list, so that you can for example render other components in that language.
Yeah, echoing back the requested locale probably works.
Should ECMA-402 change to require this?
@zbraniecki Thoughts on the above?
The web reality is that it will return the closest locale the engine had data for:
(new Intl.DateTimeFormat("es-FR")).resolvedOptions().locale == "es"
The key question is whether ICU4X should push the first implementation to ship ICU4X-backed ECMA-402 to the Web to bear the cost of finding out if deviating from the current Web reality is Web-compatible.
Given that ICU4X is supposed to work as an ECMA-402 back end, it would be rather odd for ICU4X to resist being able to implement what ECMA-402 currently says in a similar way to how deployed implementations do it.
Perhaps echoing back the requested locale would work, but is e.g. Chrome willing to try it out to see if it's Web-compatible?
I suggest doing what I originally requested but behind a repulsively-named Cargo option. That is, if conceptually_questionable_resolved_locale_for_ecma_402_compat_only is enabled, each ICU4X objects that has try_new with locale would gain a field for the resolved locale, a getter for that field, and would store the locale from the appropriate payload metadata in that field in the code that loads the data.
- @sffc first problem: it's unclear how clients use the resolved locale. It might be useful to figure out whether you fell back to
undor to determine which locale from a list got used. - @sffc second problem: locale you end up falling back to is completely arbitrary, based on data. If you request e.g. hi-Latn it may fall back to something like English, which is valid, even if it looks like you did a fallback to the host locale or elsewhere. To get the ECMA/Chrome behavior you would need to run datagen in hybrid mode so we can remember how things resolved.
- @markusicu - I feel sorry for everyone who has to deal with this because of ECMA. We've dealt with this in ICU4C/ICU4J and also have struggled with it. It's messy and confusing and unsatisfying. In ICU we have an API that lets you get "requested", "valid", and "actual" locale. The "requested" locale, we deleted a while ago. "Valid" and "actual" have been draft for a long time because we couldn't figure out what it should really do. "Actual" should say where the data comes from since the ICU model is equivalent to the CLDR model. "de-AT" might get data from "de" or from root. The messy thing of course is that services load data from multiple sources, like datetime format. For collator, there's often a clear answer. But if you have an alphabetic index, there might be data coming from multiple sources. You can choose one or the other but it doesn't always make sense.
- @hsivonen - Currently ECMA-402 does not let you know that you fall back to root because you fall back to the host locale first. You can't tell if you got the host locale because of fallback or because the requested locale resolves to the same as the host locale. The use case as far as I can tell is one stack overflow question where you can read the host locale. But as an engine developer it is risky to do something different than the browser with the larger market share. So considering that ICU4X has ECMA-402 as a design goal, it seems like it would make sense to have a cargo option to opt in.
- @markusicu - Can the ECMA API be deprecated?
- @Manishearth - The tricky thing is that what you really care about is, did fallback go past some threshold? The problematic threshold is both usecase and key dependent.
hi-Latnfalling back toenmaybe is okay for symbols but not patterns. Maybeen-INis okay but noten-001. You could come up with a strategy for plurals saying, it's okay for it to fall back across countries but not across languages otherwise. For datetime symbols, you could have fallback multiple times because there are multiple keys, theoretically capable of multiple answers, barring runtime fallback mode. If you really want the correct answer, you need hybrid mode, not runtime fallback mode. The fact that it's key-dependent means I don't see a way of doing this well, and that's even before deciding the threshold for each use case. The main use case seems to be, if fallback went too far, you need to load more data. But "going too far" is not something we can define, it's use case dependent and key dependent and a mess. - @Manishearth - can you link to the stackovverflow q?
- @robertbastian - https://stackoverflow.com/questions/673905/how-to-determine-users-locale-within-browser
- @robertbastian - We support this in ICU4X if you have custom data. This is the ECMA wrapper's problem. I also understand the web compat risk. Can we work with Chrome to change the behavior and change the ECMA-402 spec to clarify what the resolved locale actually means? Because we as the ICU4X and Unicode team can't come up with use cases for this.
- @markusicu - the use case of inspecting the getLocale() is better served via an input that says what fallback is acceptable. we have started adding that in some ICU APIs ("noSubstitute" boolean), basically saying whether it's ok to fall back to the default locale.
- @markusicu - if people care about fallback-too-far, then the data provider should signal that, rather than the object built from it
- @sffc +1 we should clarify on 402 side. But we can already do this in ICU4X, have a tutorial for this. You gen in hybrid mode, and then pick a key and use this tutorial to access the resolved locale for that key. Tricky API, but consistent with our design. I do not like having a Cargo feature for this, because storing a locale will increase the size of all our internal formatters. We're actively trying to reduce global feature flags, because they are very infectious in some build systems.
- @hsivonen: If we want to use the compiled data APIs we would have to do a second lookup. I'd prefer adding an API and field behind features [...]
- @sffc: You can do this without a second lookup
- @Manishearth - The design is incomplete because there's not a single answer
- @hsivonen - I want something that's close enough for what Chrome does for 402
- @Manishearth - That sounds too niche; we should give you the tools to build it yourself. Shane's tutorial shows how to do it. We could design an FFI or non-FFI API for this that stores every key that was resolved. That's an okay data structure to design since it's a fancier version of what's in the tutorial. We could stick it in the adapters. Browser implementations could use that. So I like a path where we build a data structure...
- @sffc: I like that proposal. About what Henri said about not being able to use the compiled data APIs: it makes me a little sad, these APIs are designed as a low-friction entry point, anyone who wants to do beyond-bare-bones things should use the full feature set we have. We've added compiled data APIs not for power users like Firefox. I don't think it's in scope to add a feature, it's a slippery slope to add a flag to use this with compiled data. I think we need to keep compiled data extremely clean and simple, we can build more abstractions outside compiled data.
- @robertbastian - compiled_data uses runtime fallback, so we can't support this anyway because you need data built in hybrid mode. I don't even want to expose this as an adapter, it feels like an antipattern to look at the resolved locale, and users should not be doing this. There will probably be at most two clients who will use ICU4X to implement ECMA402, and they should use custom optimized wrappers
- @hsivonen - If Chrome were to take the position that this stuff is silly, let's echo back the requested locale, that would be great. Putting in the spec that it is deprecated doesn't show whether echoing the request back breaks the Web.
- @Manishearth - We have more than 2 use cases. There are many data loading strategies that need something like this. If you're doing runtime data loading, you may need a catalog of all the data you have. In Dart Intl, there was some talk on runtime data loading like that.
- @sffc: we can probably update the 402 spec to echo the normalized resolved locale and have Frank implement this in Chrome.
- @robertbastian - I still think the standalone adapters API is not necessary; it's very simple to do this yourself and whatever we do will not work for everyone (mutex or not, what exactly to store, etc.).
- @Manishearth/@sffc - FFI.
- @robertbastian - Okay, we could add it to FFI but not Rust.
- @hsivonen - Do we know how much Boa wants to match Node/Chrome behavior?
- @sffc - Maybe we should tag Jedel on the issue.
@jedel1043 , see above:
Do we know how much Boa wants to match Node/Chrome behavior?
@jedel1043 , see above:
Do we know how much Boa wants to match Node/Chrome behavior?
Personally, I think it's alright if Boa has to deviate from V8 in order to offer better locale results.
Based on https://github.com/tc39/ecma402/issues/830#issuecomment-1711965221, there could a way for datagen to store the list of locales for which it generated data and then some API to access that list. Needs design work.
Discuss with:
- @sffc
- @robertbastian
- @hsivonen
Optional:
- @Manishearth
- @zbraniecki
Can we merge the discussion into #58? It seems to be going the same direction
In favor of merging.
Not exactly the same set of questions. This thread is about ResolvedLocale and #58 is about SupportedLocales. The solutions may overlap.
In #58 we have concluded that exposing a set of supported locales is not feasible, but determining the resolved locale is. You literally have a draft ResolvedLocalesAdapter for that issue, so I really struggle to understand what the difference is.
I changed https://github.com/unicode-org/icu4x/pull/4607 to be closing this issue.
- @hsivonen - In Segmenter and Collator, locales often fall back to root. Is this solution for APIs that process or consume natural language content? ICU4C retains knowledge of which languages it explicitly knew about but which have root as valid. One bug is that Sanskrit is listed as known, but that is a bug since the Sanskrit file is draft, so the data still falls back to root. For Segmenter, in Chrome and Firefox, supported-locale queries are completely bogus: it has nothing to do with segmentation. It queries some data like, does CLDR know about this language? I think it checks DateTime formatting. For APIs that process language, are there more cases where this is interesting than Collator?
- @sffc - Do we retain the data about whether the language is supported in Collator?
- @hsivonen - Not right now. We would need to add that to icuexportdata. But, it would be okay to manually maintain the list because it is unlikely to change. Actually that list could just go into the ECMA-402 implementation.
- @sffc - My takeaway is that the model we've designed for resolved/supported locales should be good for the formatter use case, and it is generalizable enough to support the language processing APIs.
- @hsivonen - If we have a list for Collator, we could look at the data files once, write it down, and not make genrb poke at the APIs.
Also see comment from @mihnita in https://github.com/unicode-org/icu4x/issues/2237#issuecomment-1201557171
2.0 blocking question: does baked data do this? From preliminary tests there might a non-trivial size impact, I'll get some better numbers.
I still think retain-base-languages should fix this, but currently it does not fill in the missing locales:
$ cargo run -p icu4x-datagen -- --deduplication retain-base-languages --markers collator/data@1 collator/meta@1 --locales full --format dir --out /tmp/collator_retain -W
warning: ignoring `resolver` config table without `-Zmsrv-policy`
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.11s
Running `target/debug/icu4x-datagen --deduplication retain-base-languages --markers 'collator/data@1' 'collator/meta@1' --locales full --format dir --out /tmp/collator_retain -W`
2024-08-08T18:07:18.825Z INFO [icu_provider_export::export_impl] Datagen configured with deduplication retaining base languages, and these locales: ["<all>"]
2024-08-08T18:07:19.387Z INFO [icu_provider_export::export_impl] Generated marker collator/meta@1 (0.561s, 'si/dict' in 1.829ms, flushed in 4.183µs)
2024-08-08T18:07:19.529Z INFO [icu_provider_export::export_impl] Generated marker collator/data@1 (0.704s, 'zh/stroke' in 0.352s, flushed in 4.723µs)
$ ls /tmp/collator_retain/collator/meta@1/
af.json br.json dict et.json gu.json ig.json kok.json ml.json pa.json sk.json te.json unihan
am.json bs-Cyrl.json dsb.json fa.json ha.json is.json ku.json mn.json phonebk sl.json th.json ur.json
ar.json bs.json ee.json ff-Adlm.json haw.json ja.json ky.json mr.json phonetic smn.json tk.json uz.json
as.json ceb.json el.json fi.json he.json ka.json lkt.json mt.json pl.json sq.json to.json vi.json
az.json chr.json emoji fil.json hi.json kk.json ln.json my.json ps.json sr.json trad wo.json
be.json compat en-US-posix.json fo.json hr.json kl.json lo.json ne.json ro.json sr-Latn.json tr.json yi.json
bg.json cs.json eo.json fr-CA.json hsb.json km.json lt.json no.json ru.json stroke ug.json yo.json
bn.json cy.json eor fy.json hu.json kn.json lv.json om.json se.json sv.json uk.json zh.json
bo.json da.json es.json gl.json hy.json ko.json mk.json or.json si.json ta.json und.json zhuyin
- @hsivonen Assumed the thing that @sffc and @robertbastian agreed to would solve this, but seen no attention payed to since
- @sffc We have since implemented SupportedLocale and ResolveLocale (https://github.com/unicode-org/icu4x/issues/58 resolved but not closed)
- @hsivonen Main webcompat issue is that the web expects certain locales that ICU4C knows are "known" locales (en, fr) even though they fall back to root. We could hack around this with somewhere to stick the list of languages that ICU4C knows fall back to root (english, french, dutch, portuguese, italian, etc.), with the exception of sanskrit. Web exposed behavior can be to pretend that they resolve to themselves.
- @sffc If I run icu4x-datagen with new flags:
--deduplication retain-base-languages --markers collator/data@1 --locales full --format dir, languages missing are?- @hsivonen It doesn't have en.json, fr.json, etc. (even with
collator/meta@1it still doesn't retain base languages)
- @hsivonen It doesn't have en.json, fr.json, etc. (even with
- @sffc - I think I see why.
retain-base-languagesdoesn't include languages unknown to data, such as Klingontlh. As far as collator data is concerned,enis like Klingon because there is no data for it. We could changeretain-base-languagesbehavior, or we could addento the ICU Export Data. - @Manishearth - It also sounds that there is a workaround for browsers that will work. I would like if ICU4X has the 402 weirdness available in some way.
- @sffc - It sounds like this doesn't block 2.0 from an API standpoint, and we have paths forward.