ICU4X Locale capabilities should enable implementation of CSS lang()
Currently ICU4X Locale supports UTS 35, which excludes parts of BCP 47, specifically "non-canonical", "irregular", or "privateuse".
In https://bugzilla.mozilla.org/show_bug.cgi?id=1857742 , Gecko restored support for these non-UTS 35 deprecated tags in CSS lang(). Unfortunately, a GitHub search shows that lang="en-GB-oed" has made it to some copypasteable template, so chances are that it would actually be a Web compat issue to reject some of the inputs that icu_locale_core::Locale::try_from_utf8 rejects.
We should add a legacy parsing entry point that compared to icu_locale_core::Locale::try_from_utf8:
- Controllably by a parameter allows the underscore where
try_from_utf8only allows a hyphen. - Parses inputs that start with
x-as if they started withund-x-. - Performs the substitutions documented at https://github.com/unicode-org/cldr/blob/fe0c625ac37e946aad3d787eeb549944d3f3bf0c/common/supplemental/supplementalMetadata.xml#L33 .
For the third point, it should be fine to hard-code the substitutions into ICU4X without involving data loading, since the mapping can be expected to be very stable.
This would allow implementing https://drafts.csswg.org/selectors-4/#the-lang-pseudo with ICU4X. AFAICT, actually performing the substitutions per point 3 above is the correct way to go according to the CSS normative references, and this wouldn't change the value space of icu_locale_core::Locale objects.
(I'll file a separate issue about https://www.rfc-editor.org/rfc/rfc4647#section-3.3.2 )
I'm reluctant to accept it into ICU4X. It seems out of scope for a JS ECMA-402, and a candidate for a separate parser that parses to icu4x locale. This position is weakly held and I'm open to be convinced way from it.
I've seen similar issues come in multiple times throughout ICU4X's lifetime. It definitely seems that there is demand for some API that takes some garbagey locale-like string and makes it into a nice ICU4X Locale.
Another use case was supporting the old ICU locale syntax like ar@numbers=latin (#4566).
We should consider putting the API on LocaleCanonicalizer, which already loads a bunch of data (and could load a bit more). Something like:
impl LocaleCanonicalizer[Borrowed] {
pub fn parse_locale(&self, locale_str: &str, options: ParseOptions) -> Result<Locale, ParseError>
}
I think the "lenient on the input" parser is different from "strict according to spec Y" parser. I think browsers want the latter here, and I think there's value in such a parser for each spec. I'm reluctant to bring the set of such parsers into ICU4X.
Triage: Generally we'd like something like this (people have asked for it multiple times). Priority backlog milestone as nobody has currently committed to having it happen in a particular release stream.