icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

ICU4X Locale capabilities should enable implementation of CSS lang()

Open hsivonen opened this issue 7 months ago • 3 comments

Currently ICU4X Locale supports UTS 35, which excludes parts of BCP 47, specifically "non-canonical", "irregular", or "privateuse".

In https://bugzilla.mozilla.org/show_bug.cgi?id=1857742 , Gecko restored support for these non-UTS 35 deprecated tags in CSS lang(). Unfortunately, a GitHub search shows that lang="en-GB-oed" has made it to some copypasteable template, so chances are that it would actually be a Web compat issue to reject some of the inputs that icu_locale_core::Locale::try_from_utf8 rejects.

We should add a legacy parsing entry point that compared to icu_locale_core::Locale::try_from_utf8:

  1. Controllably by a parameter allows the underscore where try_from_utf8 only allows a hyphen.
  2. Parses inputs that start with x- as if they started with und-x-.
  3. Performs the substitutions documented at https://github.com/unicode-org/cldr/blob/fe0c625ac37e946aad3d787eeb549944d3f3bf0c/common/supplemental/supplementalMetadata.xml#L33 .

For the third point, it should be fine to hard-code the substitutions into ICU4X without involving data loading, since the mapping can be expected to be very stable.

This would allow implementing https://drafts.csswg.org/selectors-4/#the-lang-pseudo with ICU4X. AFAICT, actually performing the substitutions per point 3 above is the correct way to go according to the CSS normative references, and this wouldn't change the value space of icu_locale_core::Locale objects.

(I'll file a separate issue about https://www.rfc-editor.org/rfc/rfc4647#section-3.3.2 )

hsivonen avatar Jun 12 '25 14:06 hsivonen

I'm reluctant to accept it into ICU4X. It seems out of scope for a JS ECMA-402, and a candidate for a separate parser that parses to icu4x locale. This position is weakly held and I'm open to be convinced way from it.

zbraniecki avatar Jun 12 '25 15:06 zbraniecki

I've seen similar issues come in multiple times throughout ICU4X's lifetime. It definitely seems that there is demand for some API that takes some garbagey locale-like string and makes it into a nice ICU4X Locale.

Another use case was supporting the old ICU locale syntax like ar@numbers=latin (#4566).

We should consider putting the API on LocaleCanonicalizer, which already loads a bunch of data (and could load a bit more). Something like:

impl LocaleCanonicalizer[Borrowed] {
    pub fn parse_locale(&self, locale_str: &str, options: ParseOptions) -> Result<Locale, ParseError>
}

sffc avatar Jun 13 '25 21:06 sffc

I think the "lenient on the input" parser is different from "strict according to spec Y" parser. I think browsers want the latter here, and I think there's value in such a parser for each spec. I'm reluctant to bring the set of such parsers into ICU4X.

zbraniecki avatar Jun 13 '25 22:06 zbraniecki

Triage: Generally we'd like something like this (people have asked for it multiple times). Priority backlog milestone as nobody has currently committed to having it happen in a particular release stream.

Manishearth avatar Aug 22 '25 21:08 Manishearth