ecma402 `BestAvailableLocale` operation requires additional clarifications/changes to handle extensions

Hello! I'm part of the development team of boa, where we are developing an ECMAScript engine with Intl support, and we have a bit of a problem with how the BestAvailableLocale operation is described.

For context, I had a discussion with some of the icu4x folks (https://github.com/boa-dev/boa/pull/2072#issuecomment-1141625219) and we discovered that the BestAvailableLocale operation, as currently described in the latest revision, tries to match invalid Unicode BCP 47 locale identifiers when executing the algorithm described.

https://github.com/boa-dev/boa/pull/2072#issuecomment-1141836914

I don't think Language Identifiers are the problem per se; I can easily manipulate LanguageIdentifiers using the provided API, and a canonicalized Language Identifier cannot become invalid by following the spec algorithm. The problem are extensions, which could become invalid if that algorithm is followed as stated e.g. with hi-t-en-h0-hybrid it would try to match with hi-t-en-h0, which has an invalid t extension since the h0 key ought to have a value.

The team kindly recommended us to treat the input locale as a simple Language Identifier instead, which should work but doesn't match the algorithm described. Knowing that, the question about extensions still remains, and I would like to open this issue as a request to add additional clarifications on BestAvailableLocale. Some of our questions are:

Should we treat locale as a plain Language Identifier?
Should we extract the Language Identifier part from locale and try to match with that, then append the removed extensions to the matched locale?
Is there another way to describe this algorithm that doesn't generate invalid locales and doesn't treat them as simple string literals?

Thanks!

cc @zbraniecki @sffc

May 31 '22 22:05 jedel1043

The team kindly recommended us to treat the input locale as a simple Language Identifier instead, which is a bit of a "hack" but it should work.

FWIW, I don't think it's a hack. It is what the operation meant to be - take LanguageIdentifier and cut from the right. It's dummy (we need data to do better), but okayish. What we didn't anticipate when desigining it is that we'll have extensions in play. No CLDR data has extensions to match on, so we can safely just remove it.

Jun 01 '22 02:06 zbraniecki

A very basic algorithm is to first clear all extension keywords, which you can do by pulling the LanguageIdentifier out of the Locale, and then removing trailing subtags one by one.

Jun 01 '22 03:06 sffc

The team kindly recommended us to treat the input locale as a simple Language Identifier instead, which is a bit of a "hack" but it should work.

FWIW, I don't think it's a hack. It is what the operation meant to be - take LanguageIdentifier and cut from the right. It's dummy (we need data to do better), but okayish. What we didn't anticipate when desigining it is that we'll have extensions in play. No CLDR data has extensions to match on, so we can safely just remove it.

Completely agree. I edited the issue to better express this.

Jun 01 '22 03:06 jedel1043

@sffc should we clarify it in the spec? The current algo makes it seem like you should keep cutting out from the tail end one subtag after another including extensions. Which, on top of being useless, leads to invalid states of locale.

Jun 01 '22 06:06 zbraniecki

Yep; this issue is available for anyone who wants to work on it.

Jun 01 '22 06:06 sffc

Possibly related to #213

Jun 01 '22 06:06 sffc

[ ]

Nov 11 '22 08:11 johndoe-glitch

ecma402 ecma402 copied to clipboard

`BestAvailableLocale` operation requires additional clarifications/changes to handle extensions

ecma402
ecma402 copied to clipboard