icu4x
icu4x copied to clipboard
Add support for Unicode BCP 47 locale identifiers
Certain ECMA-402 tests are failing because ICU4X considers any Unicode CLDR locale identifier as a valid Locale, whereas ECMA-402 is pretty explicit on isStructurallyValidLanguageTag that:
locale does not use any of the backwards compatibility syntax described in Unicode Technical Standard #35 LDML § 3.3 BCP 47 Conformance;
Which is equivalent to an Unicode BCP 47 locale identifier. This causes problems on tests like:
https://github.com/tc39/test262/blob/main/test/intl402/Intl/getCanonicalLocales/invalid-tags.js
The only differences between the two syntaxes are (taken from https://unicode.org/reports/tr35/#BCP_47_Conformance):
- The "_" character for field separator characters, as well as the "-" used in [BCP47] (however, the canonical form is with "-")
- The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)
- The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.
It would be great to have either a feature flag to enable the stricter syntax or an additional constructor on Locale that can be used when you need to parse Unicode BCP 47.
The only difference in our handling is that we're accepting _. You can enforce more strict model by doing:
fn parse_bcp47(input: &str) -> Result {
if input.contains('_') {
return Err(ParserError::InvalidSubtag);
}
input.parse::<LanguageIdentifier>()
}
I tested it against the test262 corpus and it passed.
I'm not sure if I want to support separate constructor, or option. I feel like ECMA-402 approach is breaking the "be lenient in what you accept, be strict in what you produce" model and I don't see a significant value in it.
Oh, thought you fully implemented the Unicode CLDR locale identifier syntax, not only underscore handling. That should be a very easy fix from our side then. Though, wouldn't that mean that Locale doesn't comply with CLDR locale identifiers nor with Unicode BCP 47 locale identifiers?
Technically, you are correct. We support Unicode BCP 47 locale identifier plus _ on input.
At the very list we should add doc comment about Unicode CLDR compatibility (and ECMA-402 compatibility). PR welcome!
We discussed this in https://github.com/tc39/ecma402/issues/777 and decided that ECMA-402 will continue rejecting underscores for the time being until we have more compelling evidence to make the change.
It looks like the only action needed here then is documentation.
I think there are a few directions we could take on this issue:
- Document that the only difference between ECMA-402 acceptable locale identifiers and ICU4X acceptable locale identifiers is the
_ - Add an API for parsing locale identifiers according to ECMA-402
- In 2.0, change the default ICU4X behavior to reject underscores
Thoughts @zbraniecki @jedel1043 ?
sgtm
sgtm
Which option? 0, 1, or 2?
I'd say 1 > 2 > 0, just because handling _ separately makes engines do double work, since they have to first traverse the string to check for _, then traverse the string to parse the locale.
Ah, I assumed you're proposing all three. My bad.
I think we can do 0 and 1 now, and then consider 2 in 2.0.
How does it sound?
I don't necessarily want to spend time doing (1) if we intend to follow with (2). There isn't that much time until 2.0
But (2) is not definitive, right?
I remember someone mentioned in the last meeting that some people were using _ when trying to create locales, and rejecting that separator is just an ECMA-402 requirement. In my opinion, it would be preferable to have a more flexible parsing as the default, and a stricter parsing for Intl implementors.
Well we also want to grow into supporting other locale syntaxes such as AR_EG@NUMBERS=LATN which is a legacy locale syntax from ICU4C, so we probably want a BCP-47 parse function plus an all-encompassing legacy parse function. Not sure if we want or need the middle parse function that supports _.
I would like to ask for explicit consensus on the following proposal:
- Do not take action on this issue until 2.0, but we would accept a PR adding documentation for the existing behavior.
- In 2.0, change the parse function to be stricter and reject underscores.
- Open a follow-up issue to add a more flexible locale parsing function that accepts both underscores and legacy ICU syntax like
AR_EG@NUMBERS=LATN. Put this issue on the Priority Backlog.
Needs approval from:
- [ ] @zbraniecki
- [ ] @jedel1043
- [x] @Manishearth
Sounds good to me!
handling _ separately makes engines do double work, since they have to first traverse the string to check for _, then traverse the string to parse the locale.
I would not make a decision based on this. Locales are tiny, I don't think iterating twice will be noticeable.
- @zbraniecki - We could add underscore support as a constructor flag. That's fine. I would be reluctant to shove the legacy format... I would like that to be a separate path. That could go into
icu_locale_fallback, oricu_locale_transform, ... - @robertbastian - I like the concept of being lenient in what we accept and strict in what we produce. If ECMA wants to reject underscores, they can check first
- @zbraniecki - Locale parsing is a perf-critical path. I think a flag doesn't introduce performance penalty. But I would consider double-parsing to be a non-starter on a perf-critical code path.
- @robertbastian - I am not in favor of boolean arguments in constructors. We could use an enum but it adds a lot of surface.
- @Manishearth - In general I agree with "liberal in what we accept", but I think we shouldn't carry around the
@syntax. - @sffc - I think we can simply add the strict parsing function, like
Locale::from_str_strict. - @zbraniecki - This would also require adding
from_utf8_strict,from_utf16_strict, ...
Proposal:
impl FromStr for Localecontinues accepting underscores- New function
Locale::from_[bytes/ascii/utf8]_strictthat has ECMA-402 behavior - Parsing of
@syntax does not go into the core crate
LGTM: @robertbastian @sffc @zbraniecki @Manishearth