icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Design how we want to resolve preferences

Open sffc opened this issue 8 months ago • 4 comments
trafficstars

This issue is about how we resolve Unicode extension keywords, such as -u-hc, -u-fw, -u-nu, -u-co, and -u-ca, i.e. icu::locale::preferences enums, against the language, script, and region.

What we currently do

We currently do something very different in all three cases.

Hour Cycle

https://github.com/unicode-org/icu4x/blob/175a3167bdf37f48b4ffc52f0474905a95128514/components/datetime/src/raw/neo.rs#L258

  1. If we have an explicit hour cycle: when loading the time pattern, use the marker attribute h (12-hour) or h0 (24-hour).
  2. If we don't have an explicit hour cycle, or if we couldn't load data for the explicit hour cycle: when loading the time pattern, use the marker attribute j, which means "default for locale".

Example data:

https://github.com/unicode-org/icu4x/tree/icu%402.0.0-beta2/provider/source/data/debug/datetime/TimeNeoSkeletonPatternsV1

Then, the hour cycle behavior is simply based on whether the loaded pattern has an H or an h (or K or k).

Note: most locales duplicate data between the explicit and implicit data marker attributes. (The data infra deduplicates them later.)

Numbering System

https://github.com/unicode-org/icu4x/blob/450bb445e32149e2825a055b5ebd8490b4b6694b/components/decimal/src/lib.rs#L193

  1. If we have an explicit numbering system: use it directly as the marker attribute.
  2. If we don't have an explicit numbering system, or if we couldn't load data for the explicit numbering system: do not use a marker attribute at all.

Example data:

https://github.com/unicode-org/icu4x/blob/icu%402.0.0-beta2/provider/source/data/debug/decimal/DecimalSymbolsV2/th.json

Then, the loaded data struct contains the identity of the numbering system in its data payload.

First Day of the Week

https://github.com/unicode-org/icu4x/blob/450bb445e32149e2825a055b5ebd8490b4b6694b/components/calendar/src/week_of.rs#L53

Here, we don't even read -u-fw (bug!) and always load the payload. Example data:

https://github.com/unicode-org/icu4x/tree/icu%402.0.0-beta2/provider/source/data/debug/calendar/CalendarWeekV2

Note that this is a region-based data marker.

Calendar

https://github.com/unicode-org/icu4x/blob/450bb445e32149e2825a055b5ebd8490b4b6694b/components/calendar/src/any_calendar.rs#L1062

The resolution logic is hard-coded (incorrectly: https://github.com/unicode-org/icu4x/pull/6325). There is a comment saying that we should eventually make it data-driven.

Collator

https://github.com/unicode-org/icu4x/blob/450bb445e32149e2825a055b5ebd8490b4b6694b/components/collator/src/comparison.rs#L152

Similar to numbering system, we first attempt to load with a marker attribute equal to the explicit collation type, and if that fails, we load with no marker attribute. Presumably we can then determine which flavor of collation we resolved to.

Example data:

https://github.com/unicode-org/icu4x/tree/icu%402.0.0-beta2/provider/source/data/debug/collator/CollationMetadataV1

Discussion

Numbering System, Hour Cycle, and Collator are all basically the same: load the payload that you were going to load anyway to do the thing you need the attribute for, and figure out the preference from that. If you need any other special logic to correctly resolve the preference, do it at datagen time and emit the correct set of locales.

This strategy doesn't really work for the calendar, because there isn't a single payload we always load. Each calendar needs a different payload, or no payload at all.

Another key difference is that the calendar system is region-based, whereas most of these preferences (except first day of week) are language-based. (There is an ongoing discussion involving @richgillam and others about moving more of these preferences to be region-based.)

What should we do moving forward?

I'm not here to express an opinion. I am here to present my research and solicit other people's opinions.

@Manishearth @zbraniecki @robertbastian

sffc avatar Mar 20 '25 02:03 sffc

The closest way to model the calendar like the other preferences would be to add a new data marker called something like CalendarPreferenceV1, which contains a single string equal to the preferred calendar. It would be a region-based string, and any likely subtags would be resolved at datagen time.

Or, we can keep it hard-coded, which is probably more efficient given that this is a fairly hot code path, and just test it better.

There are advantages and disadvantages to both approaches.

sffc avatar Mar 20 '25 02:03 sffc

I think hardcoding it is probably better, and we should test it. We can test it against CLDR data as well.

Manishearth avatar Mar 20 '25 21:03 Manishearth

Putting in 2.0-stretch in case we want to change anything. (A likely conclusion is that "status quo is fine")

sffc avatar Apr 25 '25 00:04 sffc

  • @sffc Should we try to centralize this logic in icu_locale? It's what I originally had in mind early in the project.
  • @Manishearth I don't see much value in centralizing the logic. These preferences are defined in BCP-47, but that is the extent of their similarities. I think a use-case-specific solution, like we currently have, is fine.
  • @sffc I wish we would have at least some principles here; it seems we don't really have any.
  • @Manishearth I think we have priciples because we arrived at a principled thing, which is the data model for these things. It has to do with what is efficient data-wise. We put these into either attributes or markers, which we decide based on how we want people slicing these.
  • @sffc Okay, so if we put them into markers, then we default to code-based fallback, like we do for calendar? And otherwise we use provider-based fallback?
  • @Manishearth I was thinking more about the data layout.
  • @Manishearth If we want to add CalendarPreferenceV1, we can probably do it in 2.x. And if we can't, it's probably not that important.
  • @sffc I think the next time we introduce a new component that uses a new Unicode extension keyword, we should revisit this issue and perhaps write down the principles.

sffc avatar May 01 '25 17:05 sffc