ecma402
ecma402 copied to clipboard
`BestAvailableLocale` operation requires additional clarifications/changes to handle extensions
Hello! I'm part of the development team of boa, where we are developing an ECMAScript engine with Intl support, and we have a bit of a problem with how the BestAvailableLocale operation is described.
For context, I had a discussion with some of the icu4x folks (https://github.com/boa-dev/boa/pull/2072#issuecomment-1141625219) and we discovered that the BestAvailableLocale operation, as currently described in the latest revision, tries to match invalid Unicode BCP 47 locale identifiers when executing the algorithm described.
https://github.com/boa-dev/boa/pull/2072#issuecomment-1141836914
I don't think Language Identifiers are the problem per se; I can easily manipulate LanguageIdentifiers using the provided API, and a canonicalized Language Identifier cannot become invalid by following the spec algorithm. The problem are extensions, which could become invalid if that algorithm is followed as stated e.g. with
hi-t-en-h0-hybridit would try to match withhi-t-en-h0, which has an invalid t extension since the h0 key ought to have a value.
The team kindly recommended us to treat the input locale as a simple Language Identifier instead, which should work but doesn't match the algorithm described. Knowing that, the question about extensions still remains, and I would like to open this issue as a request to add additional clarifications on BestAvailableLocale. Some of our questions are:
- Should we treat
localeas a plain Language Identifier? - Should we extract the Language Identifier part from
localeand try to match with that, then append the removed extensions to the matched locale? - Is there another way to describe this algorithm that doesn't generate invalid locales and doesn't treat them as simple string literals?
Thanks!
cc @zbraniecki @sffc
The team kindly recommended us to treat the input locale as a simple Language Identifier instead, which is a bit of a "hack" but it should work.
FWIW, I don't think it's a hack. It is what the operation meant to be - take LanguageIdentifier and cut from the right. It's dummy (we need data to do better), but okayish. What we didn't anticipate when desigining it is that we'll have extensions in play. No CLDR data has extensions to match on, so we can safely just remove it.
A very basic algorithm is to first clear all extension keywords, which you can do by pulling the LanguageIdentifier out of the Locale, and then removing trailing subtags one by one.
The team kindly recommended us to treat the input locale as a simple Language Identifier instead, which is a bit of a "hack" but it should work.
FWIW, I don't think it's a hack. It is what the operation meant to be - take LanguageIdentifier and cut from the right. It's dummy (we need data to do better), but okayish. What we didn't anticipate when desigining it is that we'll have extensions in play. No CLDR data has extensions to match on, so we can safely just remove it.
Completely agree. I edited the issue to better express this.
@sffc should we clarify it in the spec? The current algo makes it seem like you should keep cutting out from the tail end one subtag after another including extensions. Which, on top of being useless, leads to invalid states of locale.
Yep; this issue is available for anyone who wants to work on it.
Possibly related to #213
- [ ]