icu4x
icu4x copied to clipboard
Design how we want to resolve preferences
This issue is about how we resolve Unicode extension keywords, such as -u-hc, -u-fw, -u-nu, -u-co, and -u-ca, i.e. icu::locale::preferences enums, against the language, script, and region.
What we currently do
We currently do something very different in all three cases.
Hour Cycle
https://github.com/unicode-org/icu4x/blob/175a3167bdf37f48b4ffc52f0474905a95128514/components/datetime/src/raw/neo.rs#L258
- If we have an explicit hour cycle: when loading the time pattern, use the marker attribute
h(12-hour) orh0(24-hour). - If we don't have an explicit hour cycle, or if we couldn't load data for the explicit hour cycle: when loading the time pattern, use the marker attribute
j, which means "default for locale".
Example data:
https://github.com/unicode-org/icu4x/tree/icu%402.0.0-beta2/provider/source/data/debug/datetime/TimeNeoSkeletonPatternsV1
Then, the hour cycle behavior is simply based on whether the loaded pattern has an H or an h (or K or k).
Note: most locales duplicate data between the explicit and implicit data marker attributes. (The data infra deduplicates them later.)
Numbering System
https://github.com/unicode-org/icu4x/blob/450bb445e32149e2825a055b5ebd8490b4b6694b/components/decimal/src/lib.rs#L193
- If we have an explicit numbering system: use it directly as the marker attribute.
- If we don't have an explicit numbering system, or if we couldn't load data for the explicit numbering system: do not use a marker attribute at all.
Example data:
https://github.com/unicode-org/icu4x/blob/icu%402.0.0-beta2/provider/source/data/debug/decimal/DecimalSymbolsV2/th.json
Then, the loaded data struct contains the identity of the numbering system in its data payload.
First Day of the Week
https://github.com/unicode-org/icu4x/blob/450bb445e32149e2825a055b5ebd8490b4b6694b/components/calendar/src/week_of.rs#L53
Here, we don't even read -u-fw (bug!) and always load the payload. Example data:
https://github.com/unicode-org/icu4x/tree/icu%402.0.0-beta2/provider/source/data/debug/calendar/CalendarWeekV2
Note that this is a region-based data marker.
Calendar
https://github.com/unicode-org/icu4x/blob/450bb445e32149e2825a055b5ebd8490b4b6694b/components/calendar/src/any_calendar.rs#L1062
The resolution logic is hard-coded (incorrectly: https://github.com/unicode-org/icu4x/pull/6325). There is a comment saying that we should eventually make it data-driven.
Collator
https://github.com/unicode-org/icu4x/blob/450bb445e32149e2825a055b5ebd8490b4b6694b/components/collator/src/comparison.rs#L152
Similar to numbering system, we first attempt to load with a marker attribute equal to the explicit collation type, and if that fails, we load with no marker attribute. Presumably we can then determine which flavor of collation we resolved to.
Example data:
https://github.com/unicode-org/icu4x/tree/icu%402.0.0-beta2/provider/source/data/debug/collator/CollationMetadataV1
Discussion
Numbering System, Hour Cycle, and Collator are all basically the same: load the payload that you were going to load anyway to do the thing you need the attribute for, and figure out the preference from that. If you need any other special logic to correctly resolve the preference, do it at datagen time and emit the correct set of locales.
This strategy doesn't really work for the calendar, because there isn't a single payload we always load. Each calendar needs a different payload, or no payload at all.
Another key difference is that the calendar system is region-based, whereas most of these preferences (except first day of week) are language-based. (There is an ongoing discussion involving @richgillam and others about moving more of these preferences to be region-based.)
What should we do moving forward?
I'm not here to express an opinion. I am here to present my research and solicit other people's opinions.
@Manishearth @zbraniecki @robertbastian
The closest way to model the calendar like the other preferences would be to add a new data marker called something like CalendarPreferenceV1, which contains a single string equal to the preferred calendar. It would be a region-based string, and any likely subtags would be resolved at datagen time.
Or, we can keep it hard-coded, which is probably more efficient given that this is a fairly hot code path, and just test it better.
There are advantages and disadvantages to both approaches.
I think hardcoding it is probably better, and we should test it. We can test it against CLDR data as well.
Putting in 2.0-stretch in case we want to change anything. (A likely conclusion is that "status quo is fine")
- @sffc Should we try to centralize this logic in
icu_locale? It's what I originally had in mind early in the project. - @Manishearth I don't see much value in centralizing the logic. These preferences are defined in BCP-47, but that is the extent of their similarities. I think a use-case-specific solution, like we currently have, is fine.
- @sffc I wish we would have at least some principles here; it seems we don't really have any.
- @Manishearth I think we have priciples because we arrived at a principled thing, which is the data model for these things. It has to do with what is efficient data-wise. We put these into either attributes or markers, which we decide based on how we want people slicing these.
- @sffc Okay, so if we put them into markers, then we default to code-based fallback, like we do for calendar? And otherwise we use provider-based fallback?
- @Manishearth I was thinking more about the data layout.
- @Manishearth If we want to add CalendarPreferenceV1, we can probably do it in 2.x. And if we can't, it's probably not that important.
- @sffc I think the next time we introduce a new component that uses a new Unicode extension keyword, we should revisit this issue and perhaps write down the principles.