icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Inconsistent locale options for constructors

Open mihnita opened this issue 1 month ago • 23 comments

I would expect the code below to work:

use icu::locale::locale;

use icu::datetime::fieldsets::YMD;
use icu::datetime::DateTimeFormatter;

use icu::segmenter::WordSegmenter;

fn main() {
    let _formatter = DateTimeFormatter::try_new(
        locale!("en").into(),
        YMD::medium()
    );

    let _segmenter = WordSegmenter::try_new_auto(
        locale!("en").into()
    );
}

But it fails with

error[E0277]: the trait bound `WordBreakOptions<'_>: From<Locale>` is not satisfied
  --> src\main.rs:15:9
   |
15 |         locale!("en").into()
   |         ^^^^^^^^^^^^^ ---- required by a bound introduced by this call
   |         |
   |         the trait `From<Locale>` is not implemented for `WordBreakOptions<'_>`
   |
   = note: required for `Locale` to implement `Into<WordBreakOptions<'_>>`

mihnita avatar Dec 01 '25 03:12 mihnita

Following the example works:

use icu::locale::langid;
use icu::segmenter::options::WordBreakOptions;
...

  let mut options = WordBreakOptions::default();
  let langid = &langid!("en");
  options.content_locale = Some(langid);
  let segmenter = WordSegmenter::try_new_auto(options).unwrap();

But why the (gratuitous?) inconsistency?


WordSegmenter

https://docs.rs/icu/2.1.1/icu/segmenter/struct.WordSegmenter.html#method.try_new_auto

pub fn try_new_auto(
    options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>

pub struct WordBreakOptions<'a> {
    pub content_locale: Option<&'a LanguageIdentifier>,
    pub invariant_options: WordBreakInvariantOptions,
}

DateTimeFormatter

https://docs.rs/icu/2.1.1/icu/datetime/struct.DateTimeFormatter.html#method.try_new

pub fn try_new(
    prefs: DateTimeFormatterPreferences,
    field_set_with_options: FSet,
) -> Result<DateTimeFormatter<FSet>, DateTimeFormatterLoadError>

pub struct DateTimeFormatterPreferences {
    pub locale_preferences: LocalePreferences,
    pub numbering_system: Option<NumberingSystem>,
    pub hour_cycle: Option<HourCycle>,
    pub calendar_algorithm: Option<CalendarAlgorithm>,
}

pub struct LocalePreferences { /* private fields */ }

What I would expect:

  • both DateTimeFormatter & WordSegmenter take some kind of *Options or *Preferences, not the current mixed names (DateTimeFormatterPreferences / WordBreakOptions)
  • the options bad should name the locale part the same: locale_preferences or content_locale
  • and the locale part should take the same type langid / locale / LocalePreferences / LanguageIdentifier

Because the ideal would be that once you know how to do one i18n operation, you know them all (more or less).

mihnita avatar Dec 01 '25 03:12 mihnita

"options" and "preferences" are inherently different:

  • preferences are derived from user preferences, which is why they all implement From<Locale>
  • options are the developer-chosen options for an API. As the identifier content_locale suggests, that option is the (optional) language of the content, not the locale of the user

Some APIs accept both options and preferences: https://docs.rs/icu/latest/icu/collator/struct.Collator.html#method.try_new

robertbastian avatar Dec 01 '25 16:12 robertbastian

"options" and "preferences" are inherently different:

The names don't communicate anything about that, they are just generic.

that option is the (optional) language of the content

That's completely inconsistent with everything else.

And it is not really the language of the content, it is the language of the segmenter. I can use French segmenter on Japanese text. I will get crap results, but it is still the language of the segmenter. That is why it is the passed to the constructor of the WordSegmenter. If it was the language of the content it would be passed to the segment* methods, together with the content.

Last, even if you want to look at it as the language of the content, it is clunky to use different types (Locale vs LanguageIdentifier). In ICU4C / 4J you can just use Locale (and in Java ULocale) everywhere.

I understand why ICU4X created 2 different types. But I should be able to use them everywhere (the way in ICU4J I can use ULocale / Locale (almost) everywhere). In ICU Locale is from JDK, ULocale is ICU4J type, so the two are "from outside", ICU itself is consistent (in this respect).


There might be technical / implementations justifications for all of this. But as a user it does not matter. It makes the APIs unfriendly and not easily discoverable. Consistency is good.

mihnita avatar Dec 01 '25 17:12 mihnita

This design is the result of an extensive discussion about what "locale" means for text-oriented components like segmenter. You can read the discussion here: #3284

I just recently merged some improved docs: #7136

The observed error is doing its job of informing the client that the user locale should not be used when configuring a segmenter. The locale should instead be a hint derived from the text content.

sffc avatar Dec 01 '25 18:12 sffc

@mihnita, can you verify if the improved docs in #7136 clarify this for you, in case you hadn't seen them yet (they weren't included in the 2.1 release)?

sffc avatar Dec 01 '25 18:12 sffc

can you verify if the improved docs in https://github.com/unicode-org/icu4x/pull/7136 clarify this for you

No, it does not, not much.

It is good as documentation, but I should not have to read it.

In my opinion the ideal collection of APIs allow me to move between various functional areas without carefully reading pages and pages of documentation. Once I know how to create a date formatter, and a collator, I also know how to create a number formatter, list formatter, word segementer, etc.

And the difference in APIs might make sense for implementation, but it is gratuitous for a user.

Ultimately in the current design WordSegmenter takes a LanguageIdentifier, and DateFormatter a Locale. And Locale = LanguageIdentifier + extensions.

But LDML defines extensions that affect segmentation:

key key description
dx  Dictionary break script exclusions
lb Line break style
lw Line break word handling
ss Sentence break suppressions

https://www.unicode.org/reports/tr35/#Key_And_Type_Definitions_

It is very-very convenient to pass a locale to segmenter and "magically" have all preferences honored

Let's look at LineBreakOptions

pub struct LineBreakOptions<'a> {
    pub strictness: Option<LineBreakStrictness>,
    pub word_option: Option<LineBreakWordOption>,
    pub content_locale: Option<&'a LanguageIdentifier>,
}

If I get a locale (that's usually what you get, think a request to a server) and I want to do line breaking I have to take that locale, split the LanguageIdentifier part of it to use in content_locale, and inspect the extensions to set the strictness and word_option.

That is clunky and is not forward compatible.

If (for example) LDML adds another extension that affects segmentation, and I start getting that in requests, my server using icu4x can't "magically" honor it. Because I didn't know I needed to inspect that extension, and map it to some kind of LineBreakWordOption. I have to update my code.

If there would be a way to create a LineBreakOptions from a Locale (for example) that would work.


But as it is the APIs are inconsistent, and seem to expose decisions taken for implementation reasons.

Why not allow segmenters to take a Locale, and the implementation can ignore the extension part, for example?

You can read the discussion here: https://github.com/unicode-org/icu4x/issues/3284

I did. And I see that both Mark and Markus made the same argument for Locale in segmenters that I did. And they've been overridden without much explanation other than "we can always add it later".

"Add it later" is an easy way out for a library that can't make decisions looking at the big picture.

It comes at a cost for the developers: they must change their code to get correct functionality (see above about "forward compatible").


Nitpick: in segmenter(s) the field is named content_locale, but it is not a Locale, it is a LanguageIdentifier.

mihnita avatar Dec 01 '25 21:12 mihnita

Thank you for the comments. Much appreciated. My reply below attempts to provide background on how ICU4X arrived at the current design, without taking a position on whether the design is "right" or "wrong".


You can read the discussion here: #3284

I did. And I see that both Mark and Markus made the same argument for Locale in segmenters that I did. And they've been overridden without much explanation other than "we can always add it later".

As chair of the TC, I take great pride in our open decision-making process by consensus. I reviewed the thread, and I don't see an example of Mark or Markus being "overridden without much explanation". There were 10 people who contributed to the discussion, from multiple Unicode groups. I encourage you to share feedback privately on how this decision-making process can be improved.

But LDML defines extensions that affect segmentation:

This is true. This is a point that was not raised in the thread.

My attempt at an explanation, which may or may not be satisfactory: ICU4X and ECMA-402 have both decided to support a subset of the locale extension keywords. They do not support those that are more about developer preference over user preference. The two line break options ICU4X supports were determined to be more about the developer preference, and therefore associating them with the user locale would have been the wrong design.

As for why content_locale is a LanguageIdentifier: because it identifies the language of the content, nothing more. It is not a user locale with user preferences. It simply tags text as being written in a particular language.

Nitpick: in segmenter(s) the field is named content_locale, but it is not a Locale, it is a LanguageIdentifier.

Sure, maybe content_language would have been a better name.


My personal opinion: I've seen enough evidence in ECMA-402 that the Intl.Segmenter locale option is often used the wrong way, passing a user locale instead of a text content locale. (Intl.Collator has the same problem.) I believe that content_locale helps solve it.

sffc avatar Dec 02 '25 23:12 sffc

I believe that content_locale helps solve it.

I strongly doubt that. It is just a name...

Even that's weird. content_locale has "locale" in name. But takes a langid, not a locale.

And it really depends how you look at it.

When I create a segmenter("th") and I try to segment Japanese I will get crap results. The "th" there is the locale of the segmenter, not of the content.

When I created it it loaded Thai segmentation rules. It IS a Thai segmenter. The same way I spell check French content with a French dictionary. If the locale of the dictionary and the locale of the content match, I get good results. And I would argue that this is how most people think. Changing that pattern might help some, and confuse others.

If this is the locale of the content then that would be passed with the content:

let segmenter = WordSegmenter::new();
let seg = segmenter.segment(text, locale);

The thing is, if decisions are taken piece-meal, one at the time, no matter how reasonable in isolation, the resulting API is a hot mess.

Let's take a look:

** CaseMapper **

let cm = CaseMapper::new();
cm.uppercase_to_string("hello world", &langid!("en"))

There is a langid (not a locale), and it is passed with the content, not with the CaseMapper


Collator

pub fn try_new(
    prefs: CollatorPreferences,
    options: CollatorOptions,
) -> Result<CollatorBorrowed<'static>, DataError>

So it takes Preferences and Options. These terms are so close that the separation is basically meaningless. I choose an option or another based on my preferences :-)

But let's go on:

pub struct CollatorPreferences {
    pub locale_preferences: LocalePreferences,
    pub collation_type: Option<CollationType>,
    pub case_first: Option<CollationCaseFirst>,
    pub numeric_ordering: Option<CollationNumericOrdering>,
}

So here we have a locale_preferences, not content_locale. Why? I'm sorting content, and this is the locale of that content. Same argument you offered for Segmenters. But does not apply to Collator, for some reason.


ListFormatter

pub fn try_new_and(
    prefs: ListFormatterPreferences,
    options: ListFormatterOptions,
) -> Result<ListFormatter, DataError>

pub fn try_new_and(
    prefs: ListFormatterPreferences,
    options: ListFormatterOptions,
) -> Result<ListFormatter, DataError>

pub struct ListFormatterPreferences {
    pub locale_preferences: LocalePreferences,
}

So this is consistent with Collator.


PluralRules

pub fn try_new(
    prefs: PluralRulesPreferences,
    options: PluralRulesOptions,
) -> Result<PluralRules, DataError>

pub struct PluralRulesPreferences {
    pub locale_preferences: LocalePreferences,
}

So this is consistent with Collator and ListFormatter.

---

**DecimalFormatter**

```rust
pub fn try_new(
    prefs: DecimalFormatterPreferences,
    options: DecimalFormatterOptions,
) -> Result<DecimalFormatter, DataError>


pub struct DecimalFormatterPreferences {
    pub locale_preferences: LocalePreferences,
    pub numbering_system: Option<NumberingSystem>,
}

So this is consistent with Collator, ListFormatter, and PluralRules.

pub fn try_new_auto(
    options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>

pub struct WordBreakOptions<'a> {
    pub content_locale: Option<&'a LanguageIdentifier>,
    pub invariant_options: WordBreakInvariantOptions,
}

So this one has no *Preferences, only *Options*. The locale part is in Options, not in Preferences, like in all the other APIs. The locale setting is called content_locale, not locale_preferences, inconsistent with the other APIs. And despite the name (content_locale) the type is LanguageIdentifier. That is internal consistency. I would expect that the type of something called foo_locale is a Locale`.

What's the difference between segmenters and collator, to justify this big differences?


Like everything else in software (and not only) there is no perfect solution, it is all a matter of compromises. Speed, size, modularity, ease of use, consistency, etc.

Consistency also helps with ease of use.

The documentation for icu4x is really good, with lots of examples.

But I don't want to read it unless I don't understand how to use something. And "how we decided this" is something that I never want to read, as a user. I might read some "why do it this way" if I am really interested in i18n and in that library.

Ideally, if I am already familiar with a library, I should understand how to use a new piece of that library without reading docs of history.

mihnita avatar Dec 03 '25 16:12 mihnita

You can read the discussion here: https://github.com/unicode-org/icu4x/issues/3284

I reviewed (again) that thread.

Everybody said "locale", in all comments. I don't see anything there about langid vs locale, content_locale vs locale_preferences, Options vs Preferences, etc.

So in the end that discussion didn't help clarify the reasons for the API as it is.

mihnita avatar Dec 03 '25 17:12 mihnita

"Preferences" is a user locale with structured fields, introduced in ICU4X 2.0. "Options" are things that should be set by the developer based on application requirements, not user preferences. This naming is used consistently across all components.

The content locale is in the constructor of Segmenter because it impacts data loading, and in the terminal function of CaseMapper because it does not impact data loading. This did result in an unfortunate inconsistency. https://github.com/unicode-org/icu4x/issues/3234#issuecomment-1600648567

In Collator, the user locale matters sometimes; for example, you might have a contact list with names from various different languages, but you sort them according to your language. https://github.com/unicode-org/icu4x/issues/6033

I put this issue into the 3.0 milestone when we can re-evaluate these questions.

sffc avatar Dec 03 '25 17:12 sffc

Segmenter and case mapper are two exceptions where we use content languages instead of user locales, because it makes sense in those contexts (they should not be sensitive to user preferences). Yes the field should have probably been called content_language. Minor.

the resulting API is a hot mess

We've spent a lot of time designing this, and the API we have come up with is principled and satisfies our constraints. You were not part of any of those discussions, so you don't get to say things like this.

robertbastian avatar Dec 03 '25 18:12 robertbastian

Yes the field should have probably been called content_language. Minor.

If anything the name content_locale is fine, the type should have been Locale As I explained before, the segmentation is affected by some extensions, which are missing in a langid, but are present in a locale.


Segmenter and case mapper are two exceptions where we use content languages instead of user locales, because it makes sense in those contexts (they should not be sensitive to user preferences).

You say that they are exceptions, and are kind-of the same. But the APIs are not the same.

This is the CaseMapper:

let cm = CaseMapper::new();
cm.uppercase_to_string("hello world", &langid!("en"))

That makes sense, the CaseMapper is locale neutral, and the content is passed on with a locale.

If the segmenter was the same, it would have looked like this:

let segmenter = WordSegmenter::new();
let seg = segmenter.segment(text, locale);

But it does not.

The locale (or langid) is not passed with the content. It is passed at construction. Same as for Collator.


Let's put it differently... The icu4x "thing" doing the work is "a box", the content (text) is "ball" and locale/language is "color"

The CaseMapping API:

create a box              // CaseMapper::new()
put a red ball in the box // cm.uppercase_to_string("hello world", &langid!("en"))

What is the color of the ball (the language of the content)? RED, we know, because we said so when we stored the ball. And it is clear and matches the intuition.

Now the Segmenter API:

create a red box      // let segmenter = WordSegmenter::new(...options with lang = "th"...)
put a ball in the box // segmenter.segment("hello world")

What is the color of the ball (the language of the content)? Well, we have no clue!

OK, let's twist it a bit more and say that the parameter we pass to the box creation is the color of the balls.

create a box for red balls
put a ball in the box

What is the color of the ball (the language of the content)? Well, we have no clue! Nothing stopped me from storing a blue ball in the box for red balls. Is it wrong? Yes, it is. Same as trying to segment or collate Japanese text with a Thai segmenter or collator.

Because "Thai" is a property of the collator / segmenter, not of the content.


Conceptually the Collator is like the segmenter as like the CaseMapper.

Would you agree with that statement?

But the 3 APIs are all different. This is the inconsistency I am complaining about in this ticket.

mihnita avatar Dec 04 '25 01:12 mihnita

I appreciate the feedback on the inconsistency. As I noted in https://github.com/unicode-org/icu4x/issues/7261#issuecomment-3608046864, the inconsistency is the result of technical constraints, due to differences in how Segmenter and CaseMapper load their data, as well as choices that were made regarding the locale extension keywords. We can explore ways to make the API design more intuitive in the next major release (3.0).

sffc avatar Dec 04 '25 02:12 sffc

the resulting API is a hot mess

Please don't take a small text fragment out of context to make it sound worse than it is.

The full text is "The thing is, if decisions are taken piece-meal, one at the time, no matter how reasonable in isolation, the resulting API is a hot mess."

This is 100% true, taken without context, not necessarily about this library or these APIs.


You were not part of any of those discussions, so you don't get to say things like this.

Not being part of the discussions makes me a regular user. Which means I can express my opinion, and say "things like this". I brought reasonable arguments for it, with examples, comparing APIs next to each other.

If the attitude is "you don't get to say anything because you were not part of the discussions", then that applies to pretty much all users except the handful the were part of the spec.

You will not hear anything from me from now on, about this or anything else.

mihnita avatar Dec 04 '25 03:12 mihnita

As I explained before, the segmentation is affected by some extensions, which are missing in a langid, but are present in a locale.

And in our design this doesn't matter. We do not subscribe to the philosophy of encoding every possible argument in a single magic string/locale. We differentiate between preferences, which are globally controlled by the user/OS, and are encoded as Locales, and options, which are not controlled by the user and should not be part of a Locale. I don't know a single operating system that lets users set their "line break style" or "dictionary break script exclusions", so these are not in the preferences space. Which makes sense, because segmenters don't even consider user preferences.

Segmentation options, when implemented, will be typed fields on WordBreakOptions (and the content language is one part of these options). We want to offer a strongly typed API; making the content language a Locale would add a second set of options, as untyped strings, that would conflict (and it would make it not a content language anymore but a full "segmenter identity").

The locale (or langid) is not passed with the content. It is passed at construction.

I think we already explained that this is due to data loading. We could have made it a constructor argument for the case mapper, but then people would have to unnecessarily recreate case mappers. It's a consistency trade-off that we took.

Please don't take a small text fragment out of context to make it sound worse than it is.

As a user you can share constructive criticism, but you can do this without name calling. I'm failing to see the constructive part here though. Getting rid of the distinction between preferences and options is not going to happen, this is a core part of ICU4X design. Making the content locale be Locale is also inconsistent with this. Do you have an actual API proposal?

robertbastian avatar Dec 04 '25 10:12 robertbastian

If the attitude is "you don't get to say anything because you were not part of the discussions", then that applies to pretty much all users except the handful the were part of the spec.

I did clearly not say "you don't get to say anything". The attitude is "you don't get to insult other people's work, only your own", specifically referring to the "hot mess" statement, which I very much read as applying to our API, even if you wanted to use it generally.

robertbastian avatar Dec 04 '25 10:12 robertbastian

(as a service to everyone involved, I will lock this thread if the conversation continues to be heated.)

sffc avatar Dec 04 '25 14:12 sffc

Thank you Shane! I wanted to propose a de-escalation myself.

mihnita avatar Dec 04 '25 15:12 mihnita

I would attribute some of this heated discussion the fact that we wall want to see icu4x be the best it can be. I understand Robert's passion, as one o the owners. And if I didn't care I would have left this thread long ago, instead of trying to get my point across.

Mix that with cultural differences. As an Eastern European I would have said "this API is a hot mess", if that is what I meant :-) As an American I would have made it a "praise sandwich", with a highly toned-down negative in the middle :-) But I see how my phrasing can be a problem. I'll try to do better.

Not being face to face (even on video or audio) does not help either (not hearing the tone, not seeing the fact). Sure, face to face might also make it worse sometimes :-)

So maybe let's restart?

If it helps, this is some of the "clues" I try to use: when I sprinkle "I think" or "in my opinion" is not because I don't know, but a "tone down". I also try to add smileys and winks when I don't mean something seriously.


I will start with the bigger picture, before going back to the concrete APIs. And I will re-iterate on some of the points that I tried to make and are not addressed (I think).

As we all know, writing software is a juggle between many priorities, a give and take, of compromises.

In some of my previous projects we saved a lot of arguments and grief by writing them down, explicitly. Like correctness come first. Then our users (developers). The size, or performance, then x, y, z. Whatever.

It also helps in these kinds of discussions. Instead of pointing to a github issue with many people chiming in, one can point to "the principles".

Does icu4x have such a thing?

For example, I don't remember ever seeing an API that takes options and preferences in the same method. As a non-native English speaker a function (constructor) wold take options. That me, as developer (sometimes channeling the user) I set based on my preferences.

mihnita avatar Dec 04 '25 16:12 mihnita

Hight level / principles

We differentiate between preferences, which are globally controlled by the user/OS, and are encoded as Locales, and options, which are not controlled by the user and should not be part of a Locale.

Fair enough.

Since this is not a pattern I've seen before, it seems like a special icu4x innovation / improvement. Is it documented anywhere?

I think we already explained that this is due to data loading. We could have made it a constructor argument for the case mapper, but then people would have to unnecessarily recreate case mappers. It's a consistency trade-off that we took.

A "principles" document would also help with this. We scarified some of the API consistency for performance (if that's the case).

We want to offer a strongly typed API; making the content language a Locale would add a second set of options, as untyped strings, that would conflict (and it would make it not a content language anymore but a full "segmenter identity").

We do not subscribe to the philosophy of encoding every possible argument in a single magic string/locale

Yes, it would. And that's actually something that bothers me in the current ECMAScript Intl. There are things that can be specified both as a preference, and as an extension on the locale.

BUT! Is this documented anywhere? As a generic principle? If there is a generic rule to say "everything should be explicitly specified as an option" (or preference? I don't know). But "we never care about magic string/locale"

Then no API should take a locale. Why would some APIs care about "magic string/locale", and some don't? "No OS passes that info that way" is not an argument. That way we can end up with things like "u-nu-" is on the locale, and it's OK, but '-u-hc' is not OK. And since various OSes are inconsistent in this space, we end up with APIs that arbitrarily choose what locale extensions to honor and what not.

I am tempted to say "locale everywhere", with all the extensions honored.

This is a Unicode project, the locale is Unicode spec (LDML). Extensions and all. So a locale is not a magic string, it is very well defined.

And I can explain use cases in detail.

But the "magic string" in locale is very-very handy for communication between otherwise separated layers.

Example: client calling a server. Java / Dart / Python calling Rust. I can do (client / one layer) "getLocale" and pass that in the request to the server/other layer. I don't need to gather all the options "hey, server, render this page for me! And, btw, I prefer Bengali digits, Buddhist calendar, first day of week Monday, 24h format, ... and so on, 20 separate options that a dev must gather manually and explicitly add to the request.

It also helps with forward compatibility. If there is a new extension added to LDML, and icu4x adds support for it, then me (the dev) don't have to update all my code everywhere to pass it as an explicit option. It just "magically" moves between all kinds of layers as a locale, which I already move.

I don't know a single operating system that lets users set their "line break style" or "dictionary break script exclusions", so these are not in the preferences space.

There might be no such OS today, but there might be tomorrow. Or there might be today, and we don't know it.

For example Android already encodes some of the user preferences as locale extensions (nu, mu, ms, fw, proposed hc)

Which makes sense, because segmenters don't even consider user preferences.

OK, maybe. But maybe it's a chicken and egg problem? Can we know that? By limiting icu4x are also limiting future clients? For example, would you say that collators do consider user preferences?

mihnita avatar Dec 04 '25 16:12 mihnita

About these APIs

Once there are are clear principles, one would expect that all the APIs can be explained by the principles. There are no exceptions, because if performance is above API consistency in the principles, then the inconsistency between case conversion / segmenter is explained by that.

So let's get to an example that asked more than once: would you consider that Collator and Segmenter are very very similar?

Both require a lot more data than case mapping. And both are about the language of the content to be segmented / compared.

Would there be any reason for the APIs to be different?

mihnita avatar Dec 04 '25 16:12 mihnita

Since this is not a pattern I've seen before, it seems like a special icu4x innovation / improvement. Is it documented anywhere?

https://docs.rs/icu_locale_core/latest/icu_locale_core/preferences/index.html

Then no API should take a locale.

And none do! All our APIs take type preference objects, which can be constructed from locales. Each domain-specific preference type parses out the relvant parts of the locale, but the strings are not persisted.

Example: client calling a server. Java / Dart / Python calling Rust. I can do (client / one layer) "getLocale" and pass that in the request to the server/other layer. I don't need to gather all the options "hey, server, render this page for me! And, btw, I prefer Bengali digits, Buddhist calendar, first day of week Monday, 24h format, ... and so on, 20 separate options that a dev must gather manually and explicitly add to the request.

-u-nu, -u-ca, -u-fw, -u-hc are all preferences and are honoured as part of a string locale (see for example DateTimeFormatterPreferences). -u-dx is just not.

If there is a new extension added to LDML, and icu4x adds support for it, then me (the dev) don't have to update all my code everywhere to pass it as an explicit option.

That's why preferences convert from locales. But you as a dev don't want your segmentation to change behaviour because a locale outside of your control changed and the underlying library started reading some flag from that. The behaviour of a segmenter should be fully specified by the developer, because they have built a complex text pipeline on top of it.

For example Android already encodes some of the user preferences as locale extensions (nu, mu, ms, fw, proposed hc)

And all of those are preferences in ICU4X.

would you consider that Collator and Segmenter are very very similar?

No. I expect my Android phone book's sorting to respect my system settings, such as locale and collation settings (if available). I expect my Android text rendering to not change based on system locale.

robertbastian avatar Dec 04 '25 18:12 robertbastian

https://docs.rs/icu_locale_core/latest/icu_locale_core/preferences/index.html

Thank you very much, that is a good read. I've read it twice. And I have to say that overall the icu4x is very well documented.

It does not mean I agree with the take :-), but it is a fair position to take.

Even if I don't agree, I am also not saying it is wrong :-)

It is a decision informed by a "philosophical" position on what icu4x is. If we look at it as "icu4x is a library used to implement ECMAScript 402", then it is OK. But I see icu4x as a generic i18n library. Can be used to implement JS i18n functionality, but can be embedded in an application, or be part of an OS. Can be on a watch, or on a phone, or on a desktop.

So any kind of differentiation between OS / developer / user preference is lost. A library has no way to make that distinction. If I'm a OS developer, and wrap icu4x under my own API (the way macOS does with icu4c), then certain things are OS preferences. But if the OS exposes some of that in some kind of OS Settings app, then it becomes a user preference. If I'm a dev embedding icu4x in my app, and expose nothing to the user, it is a dev preference. But if I expose it in an app setting dialog, it is a user preference.

So a generic library has no way to make these kind of distinctions.


-u-nu, -u-ca, -u-fw, -u-hc are all preferences and are honoured as part of a string locale (see for example DateTimeFormatterPreferences). -u-dx is just not.

OK, why not? I argue that it should be.
And if it is not, maybe it is not now, but maybe it will be in the future.

The "knob" I use to pass that info should be a locale, so that the info about dx is not lost.

That's why preferences convert from locales. But you as a dev don't want your segmentation to change behaviour because a locale outside of your control changed and the underlying library started reading some flag from that.

Actually, that is exactly what many developers want!

When I use "ar-DZ" I expect that the proper digits are used for that country, and I expect that that change if a country decides to do that. I expect the DST to change, the case mapping, I expect many-many things to change. That's why I use an i18n library, to not worry about these things.

And one can make the same argument about other u extensions. Why change based on nu or hc, but not on dx? That's inconsistent.

For example Android already encodes some of the user preferences as locale extensions (nu, mu, ms, fw, proposed hc)

And all of those are preferences in ICU4X

EXACTLY!

Why those, and not other? As a user of the library that seems inconsistent and random.

No. I expect my Android phone book's sorting to respect my system settings, such as locale and collation settings (if available). I expect my Android text rendering to not change based on system locale.

As I explained above, the distinction between "system setting" and other such setting is not relevant at library level.

And in fact both examples are incorrect.

The collation does not usually change based on some kind of system setting, it is determined by use case. There is (was) a German phone book sort because it is used for (surprise) phone books. Japanese has pronunciation, and radical-stroke count, and iroha sort, but which one to use is not something that can be configured system wide, it depends on use case.

And text rendering changes based on the content locale, not system locale. If there is no info about content locale, then yes, the system locale. Even more interesting, it depends on the the complete list of system locales.

One of the changes we did in Android N was to change how the font fallback was done.

Until N if my system was set to English, any Traditional Chinese or Japanese text (without locale info) was rendered using a Simplified Chinese font. (Because statistically Kanji => Simplified Chinese)

With N one can specify more than one locale. If I configure my system locale preferences to [en-TW, zh-TW] the Chinese text is rendered with a Traditional Chinese font. Similar for Japanese. So the second locale in the list affects rendering.

And this change was most welcomed by the millions of users that were forced to see Simplified Chinese for their language, even if not appropriate.

Are there situations where one does not want this behavior?

Yes. But it is not the role of a low level library to decide that. If anything being consistent helps, because as a dev I know what to expect, because everything works the same.

If a dx override is present on a locale then it should be honored. The library has absolutely no way to decide if it is right or wrong. As a developer I expect that if I put it there, the library will respect it.

mihnita avatar Dec 09 '25 09:12 mihnita