icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Add icu_preferences util

Open zbraniecki opened this issue 3 years ago • 37 comments

Fixes #419.

This is an early WIP that finally captures what I had in mind for #419. It also creates a foundation for #594 (the data will have to come from DataProvider).

zbraniecki avatar Apr 30 '22 02:04 zbraniecki

@Manishearth do you have any suggestion on how to approach handling of ca key from Locale for selection of calendar? With the current approach of typed calendars it seems to me that we can only handle ca key in AnyCalendar. Is that correct?

zbraniecki avatar May 07 '22 21:05 zbraniecki

@Manishearth do you have any suggestion on how to approach handling of ca key from Locale for selection of calendar? With the current approach of typed calendars it seems to me that we can only handle ca key in AnyCalendar. Is that correct?

Yes, this is correct, and the plan is that the "regular" DateTimeFormat type will handle AnyCalendars, and what we currently call DateTimeFormat will instead be called TypedDateTimeFormat or something

Manishearth avatar May 08 '22 00:05 Manishearth

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • utils/preferences/Cargo.toml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • utils/preferences/Cargo.toml is different
  • utils/preferences/src/lib.rs is different
  • utils/preferences/src/macros.rs is different
  • utils/preferences/src/preferences.rs is now changed in the branch
  • utils/preferences/tests/dtf.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@sffc, @Manishearth - requesting early feedback on the discussed model.

I'm introducing generic Preferences that can be constructed from Locale, and each component can define its own XPreferences trait that implements additional getters and XPreferencesBag + ResolvedXPreferencesBag which are structs that implement it.

I focused only on architecture and DX so far and I think it starts looking quite nice. I really like how simple the constructor of the component looks like, I like that a lot of retrievals are lazy, I like the Resolved* for storing because it means that internally DTF will have all the fields accessible.

One hiccup I found so far architecturally is that if I pass Locale as an impl of DTFPreferences then hour_cycle should be fallible because in the value of the keyword may not be a valid enum variant. Not sure how to workaround it. I'd love to avoid making those getters fallible, I'd love to avoid having to eagerly resolve preferences etc. but maybe we'll have to? Not sure.

The other imperfection I see is that ResolvedXPreferencesBag stores lid which is unnecessary for defaults, so I assume that DataProvider will have DefaultXPreferencesBag that will contain just the non-lid fields for lid key. The DTF constructor will then ask DP for the defaults, and build Resolved from Default + Prefs.

I haven't looked at all at performance/memory of this approach. I'd appreciate your early feedback on that as well since I'm trying to avoid switching to that mode until DX is complete.

zbraniecki avatar May 15 '22 01:05 zbraniecki

I think this looks pretty promising!

Manishearth avatar May 16 '22 19:05 Manishearth

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • utils/preferences/Cargo.toml is different
  • utils/preferences/src/macros.rs is different
  • utils/preferences/src/preferences.rs is different
  • utils/preferences/tests/dtf.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@sffc - updated based on our HLD. Unfortunately, we didn't capture notes so I was working from memory. Please let me know how it looks to you now and if there's something I missed.

zbraniecki avatar Jun 02 '22 21:06 zbraniecki

I have some scattered thoughts. But before I get into detail on those, I want to step back and look at what the resulting DTF constructor looks like. I'd like to start there and work backwards.

I see several paths, all of which have pros and cons:

  1. Into<Locale> + provider + options bag (what we have now)
  2. Locale + provider + options bag
  3. struct DTFPreferenceBag + provider + options bag (what is being proposed)
  4. trait DTFPreferences + provider + options bag
  5. trait LocaleLike + provider + options and preferences bag (put prefs into the same bag with options)
  6. struct DataLocale + provider + options and preferences bag

Options 5 and 6 are more aligned with my mental model coming in. It is also what ECMA-402 is doing. Inside the constructor, we resolve the locale with the fields of the bag that are preferences using the same logic as ECMA-402 (favor options bag, then locale, then default fallback):

new Intl.DateTimeFormat("en").resolvedOptions().calendar
// 'gregory'
new Intl.DateTimeFormat("en-u-ca-hebrew").resolvedOptions().calendar
// 'hebrew'
new Intl.DateTimeFormat("en-u-ca-hebrew", { calendar: "japanese" }).resolvedOptions().calendar
// 'japanese'

My understanding is that @zbraniecki's main concern with using a locale to carry preferences is that we may want to scale to preferences that don't fit into BCP-47. My response to that is that trait LocaleLike or struct DataLocale can support such preferences; they need not be tied to BCP-47.

Note: struct DataLocale can be what is currently known as ResourceOptions; see #1995.

sffc avatar Jun 06 '22 21:06 sffc

struct DTFPreferenceBag + provider + options bag (what is being proposed)

This is not what I propose here. I propose just DTFPreferenceBag + provider.

it is intentional, as:

  • Locale is a subset of Preferences
  • Conversion of Locale to Preferences is fallible so I want it external to constructor

trait DTFPreferences + provider + options bag

Why would we do that? I did it before and the problem is that the trait is mainly useful to have an implicit From<Locale> which is fallible, so it's better to have it as explicit TryFrom.

trait LocaleLike + provider + options and preferences bag (put prefs into the same bag with options)

What would it give us? This makes Locale a superset of Preferences, which it is not.

struct DataLocale + provider + options and preferences bag

How is DataLocale different from DTFPreferencesBag? Is that about having a single struct or trait for all components? I have concerns about scaling of that solution.

zbraniecki avatar Jun 07 '22 02:06 zbraniecki

struct DTFPreferenceBag + provider + options bag (what is being proposed)

This is not what I propose here. I propose just DTFPreferenceBag + provider.

it is intentional, as:

  • Locale is a subset of Preferences
  • Conversion of Locale to Preferences is fallible so I want it external to constructor

Where do non-preference options fit in, like date style, components bag, etc? (This is what I meant by "options bag")

trait DTFPreferences + provider + options bag

Why would we do that? I did it before and the problem is that the trait is mainly useful to have an implicit From<Locale> which is fallible, so it's better to have it as explicit TryFrom.

We discussed this as an option in our meeting and the idea was so that foreign types could implement the trait and be compatible with ICU4X, but I think it's infeasible if every formatter has a separate trait, so I agree; we can dismiss this option.

trait LocaleLike + provider + options and preferences bag (put prefs into the same bag with options)

What would it give us? This makes Locale a superset of Preferences, which it is not.

There would be a single trait LocaleLike for all formatters that foreign types could implement, and which we could implement for Locale, LanguageIdentifier, LocaleStr, etc.

struct DataLocale + provider + options and preferences bag

How is DataLocale different from DTFPreferencesBag? Is that about having a single struct or trait for all components? I have concerns about scaling of that solution.

Yes, a single struct for all components.

I would like to understand more about your concern of scaling that solution. As far as I know, your only concern was the ability to support non-BCP-47 values. I think is a tractable problem that can be supported in a single struct: in the future, we add more fields (public or private) to hold more structured data.

Please note that the data provider framework wants everything in a single standard format (be it Locale or DataLocale or otherwise). It wants to work the same way across components.

sffc avatar Jun 07 '22 03:06 sffc

Note: my envisioned data model for DataLocale is a Language Identifier + map from Unicode Extension Key to Unicode Extension Value. (This is exactly what ResourceOptions is currently.)

sffc avatar Jun 07 '22 03:06 sffc

Where do non-preference options fit in, like date style, components bag, etc? (This is what I meant by "options bag")

They are part of DTFPreferencesBag. Locale can be either TryInto preferences or merged into it (both operations are fallible).

I would like to understand more about your concern of scaling that solution. As far as I know, your only concern was the ability to support non-BCP-47 values.

It is not my only one.

My concern is that operating on a union of all possible fields that any component may need is going to lead to awkward DX and limit DCE. It will also mean that such type has to keep getting extended for every option that any component needs and that requires revision of icu_preferences.

Note: my envisioned data model for DataLocale is a Language Identifier + map from Unicode Extension Key to Unicode Extension Value. (This is exactly what ResourceOptions is currently.)

We discussed that maybe in the future we'll want to add such "union" type, but I'd prefer not to and I'd prefer not to start with it.

The type needed by DP is suboptimal for internal use by components - in particular, an open ended key-value list can contain invalid values. One of the benefits of the design I'm providing here is that the validation of the values is enforced when working with DTFPreferencesBag and validated explicitly when converted/merged from Locale. It means that the type inside DTF is guaranteed to contain valid values.

I understand that for DP that is not the case, and I recognize that we will want to add such LID + KV map type for DP use.

zbraniecki avatar Jun 07 '22 03:06 zbraniecki

Where do non-preference options fit in, like date style, components bag, etc? (This is what I meant by "options bag")

They are part of DTFPreferencesBag. Locale can be either TryInto preferences or merged into it (both operations are fallible).

OK, so this model essentially removes the locale as an explicit argument?

Have we considered the implications of this on people knowing how and where to pass in the locale? I think it's a good design choice on the part of both ICU4C and ECMA-402 to basically always require a standalone locale argument.

My concern is that operating on a union of all possible fields that any component may need is going to lead to awkward DX and limit DCE.

I don't see how this affects either DX or DCE.

  1. For DX, the developer should put programmatic preferences in the options bag (third argument). The locale only carries the user's preferences.
  2. For DCE, irrelevant preferences in the locale are simply dropped/ignored. We only carry code to process the relevant ones, which is code we need to carry anyway.

It will also mean that such type has to keep getting extended for every option that any component needs and that requires revision of icu_preferences.

It's a map. More items can be added to the map without requiring code revision, so long as they conform to BCP-47.

The type needed by DP is suboptimal for internal use by components - in particular, an open ended key-value list can contain invalid values. One of the benefits of the design I'm providing here is that the validation of the values is enforced when working with DTFPreferencesBag and validated explicitly when converted/merged from Locale. It means that the type inside DTF is guaranteed to contain valid values.

That's a nice benefit, I agree. I also think that this case is fine to stick in an error variant if the values are invalid.

I understand that for DP that is not the case, and I recognize that we will want to add such LID + KV map type for DP use.

Understand that whatever type we add here needs to have a conversion into the DP type right off the bat. I see this as a primary motivator of the signature of formatter constructors.

sffc avatar Jun 07 '22 04:06 sffc

OK, so this model essentially removes the locale as an explicit argument?

Correct.

Have we considered the implications of this on people knowing how and where to pass in the locale? I think it's a good design choice on the part of both ICU4C and ECMA-402 to basically always require a standalone locale argument.

I'm not convinced anymore, hence this design.

My take is that Locale input from the perspective of formatters is fallible. en-us-hc-h98 is a correct Locale but the incorrect hour cycle must be handled in some way. If we accept Locale then the DTF has to decide how to handle that, either implicitly rejecting, implicitly correcting, or providing additional options to decide how to handle invalid values.

In the API I'm proposing this is moved outside of the constructor giving developers ability to decide how to handle failures and allowing us to provide more nuanced methods like .merge_leniently or .merge_strictly etc.

I'm growing belief that Locale is just a formation of preferences such as language, script, region, hour_cycle, calendar, date_style etc. I don't see reason we shouldn't allow, even in ECMA-402 like API for:

new Intl.DateTimeFormat("en-US", {
  region: "CA",
});

For DX, the developer should put programmatic preferences in the options bag (third argument). The locale only carries the user's preferences.

What is a programmatic preference in your mental model? Is hour_cycle a programmatic preference or user preference?

I argue that they are all user preferences, just provided in two ways (preferences bag vs locale) and merged.

It is also lossless one way - you can create preferences bag from locale which is a fallible operation, but the other way may be lossy, but infallible.

I agree that it is nice to allow for both - locale and preferences bag - to be provided, but I don't think they need to be provided to the constructor. It is sufficient that they can be merged prior to construction.

zbraniecki avatar Jun 07 '22 04:06 zbraniecki

Thanks for explaining your mental model more.

I think it is a valid model, and a perspective to consider, but I'm not yet convinced that it is superior to the proven traditional model, and I'm hesitant to use ICU4X as a guinea pig to validate it.

My take is that Locale input from the perspective of formatters is fallible. en-us-hc-h98 is a correct Locale but the incorrect hour cycle must be handled in some way. If we accept Locale then the DTF has to decide how to handle that, either implicitly rejecting, implicitly correcting, or providing additional options to decide how to handle invalid values.

To me, this seems like a solution without a real-world problem:

  1. Misspelled subtags are far down the list of problems that I encounter in the i18n space; BCP-47 strings are usually generated by either a machine or a power user.
  2. In cases where we do come across a misspelled subtag, the behavior is well-defined and understood: drop the subtag, just like you do when performing vertical fallback.
  3. The average developer won't know how to resolve this sort of error when it comes up; it seems like the kind of thing that is well suited to a "best effort" approach.
  4. For developers who really want this, it seems perfectly reasonable to have an external function for validating the semantic content of extension keywords; it could go in the locale_canonicalizer component and be driven by the CLDR validity data.

I'm growing belief that Locale is just a formation of preferences such as language, script, region, hour_cycle, calendar, date_style etc.

Yes, but I don't see how this particular belief translates into the structured all-in-one preferences bag proposal.

What is a programmatic preference in your mental model? Is hour_cycle a programmatic preference or user preference?

In my mental model, the bag of preferences comes from the operating system in the form of a BCP-47 string, a well-understood, extremely popular serialization format. It may contain -u-hc, indicating the user's preferred default hour cycle, which may differ from the one used in their region.

However, hour_cycle, as specified in the options bag, is the programmer specifying that they want a specific style of output. This setting overrides the locale preference.

So, -u-hc is the user override, and hour_cycle is the programmer override.

I agree that it is nice to allow for both - locale and preferences bag - to be provided, but I don't think they need to be provided to the constructor. It is sufficient that they can be merged prior to construction.

A major reason why I still think it's wise to keep the locale in the constructor signature is that it reminds the developer where they need to inject the locale. If they just provide a bag of options, they are at high risk of specifying date style and time style, for instance, and unintentionally leaving language, region, etc. at their default values. This problem could be partially mitigated by forcing DTFPreferenceBag to be built with a locale in some way, such as with a builder pattern, constructors, etc., at the expense of some ergonomics.

Making it clear where developers need to put their locale parameter is, in my mind, the most basic tenant of i18n library design.

sffc avatar Jun 07 '22 06:06 sffc

To me, this seems like a solution without a real-world problem:

Architecture of an API should be designed along the axis of fallible/infallible and lossy/lossless. Locale as is is fallible and lossy compared to Preferences.

I don't understand your argument of "solution without a problem" - we have a "problem" - we need to design API to encode information and provide DX. I provide a solution that I see as superior to what has been done, reflecting the ICU4X API composition and easily usable from ECMA-402 APIs.

In my mental model

I think your mental model is incomplete here.

"preferences bag" does not come from OS. It can come from developer choice, or from user preferences stored in the application.

Locale can come from OS, or from third-party system or can be stored as a user preference in the app as well. Both can have multiple sources and the locale comes from parent, preferences come from developer is but one of multiple models possible.

The API I'm proposing allows for composition of those models including one you evaluate, but also allowing for others including ones where developer has preferences, and user has preferences, and parent provides preferences and allows for chained multi-level combination of those in any order that the developer prefers allowing them to decide how to handle fallibility and lossiness.

The model used by ECMA-402 does not allow for any of that - it forces one architecture, forces particular failure resolution and doesn't allow for composition.

A major reason why I still think it's wise to keep the locale in the constructor signature is that it reminds the developer where they need to inject the locale.

That's a good point. I'll think about how to incorporate that into the DX.

zbraniecki avatar Jun 07 '22 19:06 zbraniecki

Architecture of an API should be designed along the axis of fallible/infallible and lossy/lossless. Locale as is is fallible and lossy compared to Preferences.

I disagree with this statement on two separate axes:

  1. Fallible APIs should be designed to give developers errors that are useful and actionable. I argued above that "misspelled BCP-47 subtags" is not a real-world problem, and errors involving it are therefore neither useful nor actionable. We should spend our DX budget wisely, and I think forcing developers to reason about malformed BCP-47 is not a good use of that budget.
  2. I further claim that Locale is not fallible and lossy compared to Preferences. I come from my background in Vertical Fallback, where we have very clear, well-defined behavior for dropping subtags that don't match what we know about the world. "en-u-hc-h25" looks invalid today, but it could mean something tomorrow, so we just ignore it when processing the extensions.

"preferences bag" does not come from OS. It can come from developer choice, or from user preferences stored in the application.

What I'm saying is that the user is represented by a thing that we typically call a locale, which today is typically a BCP-47 string, but which could contain additional structured data in the future. That locale should be computed once per user and then fed into every ICU4X API. However, at each individual call site, the developer could make overrides that affect display/formatting, and those go into the options bag.

The API I'm proposing allows for composition of those models including one you evaluate, but also allowing for others including ones where developer has preferences, and user has preferences, and parent provides preferences and allows for chained multi-level combination of those in any order that the developer prefers allowing them to decide how to handle fallibility and lossiness.

The model used by ECMA-402 does not allow for any of that - it forces one architecture, forces particular failure resolution and doesn't allow for composition.

I'm saying that in cases where the default behavior is undesirable, the developer can perform this processing externally, such as with a new function on the locale canonicalizer.

I would go one step further to say that this type of processing should be performed externally, once per user upon application startup, such that each call site needs to resolve between only two things: locale preferences and call-site-specific overrides.

sffc avatar Jun 07 '22 23:06 sffc

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • components/locid/src/extensions/unicode/keywords.rs is now changed in the branch
  • utils/preferences/src/macros.rs is different
  • utils/preferences/tests/dtf.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • components/locid/src/extensions/unicode/keywords.rs is different
  • utils/preferences/src/macros.rs is different
  • utils/preferences/tests/dtf_tests.rs is now changed in the branch
  • utils/preferences/tests/dtf.rs is no longer changed in the branch
  • utils/preferences/tests/dtf/data_provider.rs is now changed in the branch
  • utils/preferences/tests/dtf/mod.rs is now changed in the branch
  • utils/preferences/tests/dtf/options.rs is now changed in the branch
  • utils/preferences/tests/dtf/preferences.rs is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • components/locid/src/extensions/unicode/keywords.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Latest revision updated:

  • Separated Prefs from Options
  • Added all ECMA-402 DTF options

I generally am not sure if we need a macro for options - I kept it in macro because I thought there may be some value in ability to merge between prefs and options but I am growing confident that those two can stay separate all the way.

@sffc - lmk if that fits your mental model.

zbraniecki avatar Jun 20 '22 17:06 zbraniecki

Discussion:

  • @echeran - The proposal creates a whole lot of static typing. Not sure it's worth it. Not a fan of loc.into() because it's very magic; From is more explicit but less concise.
  • @markusicu - It's a big change and I'm a bit hesitant. I do think it's valueable to be able to express multiple sources of preferences information. On the Locale, it's common in client-side programming to use the default locale from the environment; ICU defaults a null locale to the environment locale. But we've also regretted that for server-side programming. I'm not a fan of loc.into(); making the type more explicit seems better. It also looks like at every site you have a service-specific preferences bag; if you do that, then why not merge preferences into options?
  • @robertbastian - I think Zibi's proposal is great; I'm okay with loc.into() because it is Rusty. It seems useful to express what our library can do with a fully expanded struct.
  • @zbraniecki - I think it's okay for the API to nudge the developer in the right direction. I think default should definitely be outside the constructor. For separating the preferences and options, they come from two very different sources. Merging them mixes them into a mixed key/value list.
  • @markusicu - What happens if someone doesn't merge a locale into the preferences?
  • @zbraniecki - It becomes language und. There is some headroom for design here.
  • @markusicu - Where do you draw the line between preferences and options?
  • @zbraniecki - One comes from the user's cultural customs; the other is the developer. It is a separation of concerns.
  • @markusicu - It's not always a black and white decision. Sometimes you want to be colloquial, and sometimes you want to be explicit. If you're displaying a train schedule, you might want to force 24-hour time.

sffc avatar Jun 30 '22 18:06 sffc

Notice: the branch changed across the force-push!

  • Cargo.lock is different
  • Cargo.toml is different
  • components/locid/src/extensions/unicode/keywords.rs is different
  • components/locid/src/langid.rs is now changed in the branch
  • components/locid/src/subtags/variants.rs is now changed in the branch
  • utils/preferences/src/lib.rs is different
  • utils/preferences/src/macros.rs is different
  • utils/preferences/src/preferences.rs is no longer changed in the branch
  • utils/preferences/tests/dtf_tests.rs is different
  • utils/preferences/tests/dtf/data_provider.rs is different
  • utils/preferences/tests/dtf/mod.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • components/datetime/src/options/preferences.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • Cargo.lock is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Sep 01 '22 06:09 CLAassistant

Let's revisit this in 2023 and make a great preferences solution both in ICU4X and EMCAScript.

sffc avatar Nov 10 '22 18:11 sffc

I unbitrot this PR, and refreshed it. It seems to be encompassing my mental model. I also added tests to verify that in the future some external handling of OS preferences can be ergonomically added to this architecture.

The age of this PR and the accumulated comments make it very hard to understand what should be the next step here.

I'd like to suggest that I document this proposal and we organize a deep dive on it for upcoming ICU4X meeting, to understand what is left to do.

zbraniecki avatar Nov 22 '23 21:11 zbraniecki