icu4x
icu4x copied to clipboard
Consider support of old locale extension syntax in locale parsing
We encounter clients that report locales using "Old locale extension syntax", e.g. ar@numbers=latn. I verified that locid::Locale parsing is rejecting such strings with the error "The given language subtag is invalid".
Should we consider extending parsing support towards this syntax?
We've previously discussed the degree to which we want to allow lenient parsing of locales. There are several toggles that could go into leniency:
- Whether to accept incorrect case:
en-us - Whether to accept underscores:
en_US - Whether to accept UTS 35 Old Locale Extension Syntax:
ar@numbers=latn
I think what we want is a more full-featured locale parsing API. It could be data-driven and live alongside (or within) the LocaleCanonicalizer type.
It's worth noting that ECMA-402 permits case 1 but not 2 or 3.
Some previous related discussion in https://github.com/unicode-org/icu4x/issues/3336, https://github.com/unicode-org/icu4x/issues/1709
CC @zbraniecki
I'm against supporting that as part of icu_locid.
This can be introduced as a standalone conversion library outside of ICU4X.
Discuss with:
- @zbraniecki
- @sffc
Optional:
- @echeran