icu4x Add support for Unicode BCP 47 locale identifiers

Certain ECMA-402 tests are failing because ICU4X considers any Unicode CLDR locale identifier as a valid Locale, whereas ECMA-402 is pretty explicit on isStructurallyValidLanguageTag that:

locale does not use any of the backwards compatibility syntax described in Unicode Technical Standard #35 LDML § 3.3 BCP 47 Conformance;

Which is equivalent to an Unicode BCP 47 locale identifier. This causes problems on tests like:

https://github.com/tc39/test262/blob/main/test/intl402/Intl/getCanonicalLocales/invalid-tags.js

The only differences between the two syntaxes are (taken from https://unicode.org/reports/tr35/#BCP_47_Conformance):

The "_" character for field separator characters, as well as the "-" used in [BCP47] (however, the canonical form is with "-")

The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)

The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.

It would be great to have either a feature flag to enable the stricter syntax or an additional constructor on Locale that can be used when you need to parse Unicode BCP 47.

Apr 16 '23 04:04 jedel1043

The only difference in our handling is that we're accepting _. You can enforce more strict model by doing:

    fn parse_bcp47(input: &str) -> Result {
        if input.contains('_') {
            return Err(ParserError::InvalidSubtag);
        }
        input.parse::<LanguageIdentifier>()
    }

I tested it against the test262 corpus and it passed.

I'm not sure if I want to support separate constructor, or option. I feel like ECMA-402 approach is breaking the "be lenient in what you accept, be strict in what you produce" model and I don't see a significant value in it.

Apr 16 '23 05:04 zbraniecki

Oh, thought you fully implemented the Unicode CLDR locale identifier syntax, not only underscore handling. That should be a very easy fix from our side then. Though, wouldn't that mean that Locale doesn't comply with CLDR locale identifiers nor with Unicode BCP 47 locale identifiers?

Apr 16 '23 05:04 jedel1043

Technically, you are correct. We support Unicode BCP 47 locale identifier plus _ on input.

Apr 16 '23 05:04 zbraniecki

At the very list we should add doc comment about Unicode CLDR compatibility (and ECMA-402 compatibility). PR welcome!

Apr 16 '23 06:04 zbraniecki

We discussed this in https://github.com/tc39/ecma402/issues/777 and decided that ECMA-402 will continue rejecting underscores for the time being until we have more compelling evidence to make the change.

It looks like the only action needed here then is documentation.

Nov 16 '23 22:11 sffc

I think there are a few directions we could take on this issue:

Document that the only difference between ECMA-402 acceptable locale identifiers and ICU4X acceptable locale identifiers is the _
Add an API for parsing locale identifiers according to ECMA-402
In 2.0, change the default ICU4X behavior to reject underscores

Thoughts @zbraniecki @jedel1043 ?

Nov 21 '23 19:11 sffc

sgtm

Nov 21 '23 19:11 zbraniecki

sgtm

Which option? 0, 1, or 2?

Nov 21 '23 19:11 sffc

I'd say 1 > 2 > 0, just because handling _ separately makes engines do double work, since they have to first traverse the string to check for _, then traverse the string to parse the locale.

Nov 21 '23 20:11 jedel1043

Ah, I assumed you're proposing all three. My bad.

I think we can do 0 and 1 now, and then consider 2 in 2.0.

How does it sound?

Nov 22 '23 17:11 zbraniecki

I don't necessarily want to spend time doing (1) if we intend to follow with (2). There isn't that much time until 2.0

Nov 22 '23 18:11 sffc

But (2) is not definitive, right?

I remember someone mentioned in the last meeting that some people were using _ when trying to create locales, and rejecting that separator is just an ECMA-402 requirement. In my opinion, it would be preferable to have a more flexible parsing as the default, and a stricter parsing for Intl implementors.

Nov 22 '23 18:11 jedel1043

Well we also want to grow into supporting other locale syntaxes such as AR_EG@NUMBERS=LATN which is a legacy locale syntax from ICU4C, so we probably want a BCP-47 parse function plus an all-encompassing legacy parse function. Not sure if we want or need the middle parse function that supports _.

Nov 22 '23 18:11 sffc

I would like to ask for explicit consensus on the following proposal:

Do not take action on this issue until 2.0, but we would accept a PR adding documentation for the existing behavior.
In 2.0, change the parse function to be stricter and reject underscores.
Open a follow-up issue to add a more flexible locale parsing function that accepts both underscores and legacy ICU syntax like AR_EG@NUMBERS=LATN. Put this issue on the Priority Backlog.

Needs approval from:

[ ] @zbraniecki
[ ] @jedel1043
[x] @Manishearth

Nov 30 '23 19:11 sffc

Sounds good to me!

Nov 30 '23 19:11 jedel1043

handling _ separately makes engines do double work, since they have to first traverse the string to check for _, then traverse the string to parse the locale.

I would not make a decision based on this. Locales are tiny, I don't think iterating twice will be noticeable.

Dec 07 '23 15:12 robertbastian

@zbraniecki - We could add underscore support as a constructor flag. That's fine. I would be reluctant to shove the legacy format... I would like that to be a separate path. That could go into icu_locale_fallback, or icu_locale_transform, ...
@robertbastian - I like the concept of being lenient in what we accept and strict in what we produce. If ECMA wants to reject underscores, they can check first
@zbraniecki - Locale parsing is a perf-critical path. I think a flag doesn't introduce performance penalty. But I would consider double-parsing to be a non-starter on a perf-critical code path.
@robertbastian - I am not in favor of boolean arguments in constructors. We could use an enum but it adds a lot of surface.
@Manishearth - In general I agree with "liberal in what we accept", but I think we shouldn't carry around the @ syntax.
@sffc - I think we can simply add the strict parsing function, like Locale::from_str_strict.
@zbraniecki - This would also require adding from_utf8_strict, from_utf16_strict, ...

Proposal:

impl FromStr for Locale continues accepting underscores
New function Locale::from_[bytes/ascii/utf8]_strict that has ECMA-402 behavior
Parsing of @ syntax does not go into the core crate

LGTM: @robertbastian @sffc @zbraniecki @Manishearth

Mar 14 '24 18:03 Manishearth

icu4x icu4x copied to clipboard

Add support for Unicode BCP 47 locale identifiers

icu4x
icu4x copied to clipboard