sys-locale icon indicating copy to clipboard operation
sys-locale copied to clipboard

Wrong locale evaluation on Unix-like systems without codeset but with modifier

Open pasabanov opened this issue 1 year ago • 6 comments
trafficstars

According to this specification, the POSIX locale is defined as:

language[_territory][.codeset][@modifier]

For example, the locale De_DE@dict is valid.

However, the current implementation of the library does not check for the @ character, leading to an invalid locale detection when the codeset is not present but the modifier is.

Example:

let fr_FR_euro = "fr_FR@euro".to_owned();
let mut env = MockEnv::new();
env.insert("LANGUAGE".into(), fr_FR_euro);
env.insert("LC_ALL".into(), fr_FR_euro);
env.insert("LC_MESSAGES".into(), fr_FR_euro);
env.insert("LANG".into(), fr_FR_euro);
env.insert("LC_CTYPE".into(), fr_FR_euro);
let locale = _get(&env);
// locale is equal to "fr-FR@euro" which is not a valid BCP 47 locale

The simplest possible solution would be:

  • [x] Crop the POSIX locale at the minimum index of . and @ characters. Resolved in #33.

However, since some POSIX modifiers might be convertible to BCP 47, a more complex solution would be:

  • [ ] Implement full support for POSIX modifiers, meaning the library would use the modifier information to form the resulting BCP 47 locale.

pasabanov avatar Sep 22 '24 14:09 pasabanov

Implement full support for POSIX modifier

Do you have a rough idea what this would entail? This is an area I'm not familiar with. The level 2 canonicalization described here look close to what you are mentioning, but I'm not 100% sure!

Generally I'd be fine making this handling more robust at the cost of complexity as long as we don't need to drag in any large ICU crates to handle bundling various chunks of locale/region data used for mapping correctly.

complexspaces avatar Sep 26 '24 05:09 complexspaces

Do you have a rough idea what this would entail? This is an area I'm not familiar with. The level 2 canonicalization described here look close to what you are mentioning, but I'm not 100% sure!

I'm not an expert it this field either. The link that you attached is about ICU locales. As I understand it, this is the third locale type along with POSIX and BCP 47. Your library is working with BCP 47 locales, as far as I know, so the conversion algorithm should be different.

For now I'm unsure, what the algorithm should be exactly.

Some useful links for further investigation:

  • ~~The Open Group Base Specifications Issue 7, 2018 edition 7. Locale: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html~~ (old version)
  • ~~The Open Group Base Specifications Issue 7, 2018 edition 8. Environment Variables: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html~~ (old version)
  • The Open Group Base Specifications Issue 8 7. Locale: https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap07.html (new version)
  • The Open Group Base Specifications Issue 8 8. Environment Variables: https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap08.html (new version)
  • BCP 47 specification: https://www.ietf.org/rfc/bcp/bcp47.html
  • BCP 47 extensions: https://github.com/unicode-org/cldr/tree/main/common/bcp47

pasabanov avatar Sep 26 '24 20:09 pasabanov

Your library is working with BCP 47 locales, as far as I know, so the conversion algorithm should be different.

The reason I mentioned it is because the ICU one appears incredibly similar to the BCP47 format and places like MSDN say "This format is used by Windows and many other environments, including ... ICU, ...". The ICU formalization seems to have, at minimum, cribbed the region code variants and handling.

If the above holds up (I could try doing a more detailed comparison), then this statement is valid for sys-locale's considerations because we are parsing POSIX locales:

Level 2 canonicalization is designed to translate POSIX and .NET IDs, as well as nonstandard ICU locale IDs.

complexspaces avatar Sep 27 '24 06:09 complexspaces

This should be reopened due to the unresolved conversation about POSIX to BCP 47 modifiers conversion.

pasabanov avatar Sep 29 '24 18:09 pasabanov

Whoops, you're right. The GitHub autoclose syntax grabbed it by mistake in your PR.

complexspaces avatar Sep 29 '24 18:09 complexspaces

That's because I wrote "partially resolves ..." there. GitHub didn't recognize the word "partially".

pasabanov avatar Sep 29 '24 18:09 pasabanov