cldr icon indicating copy to clipboard operation
cldr copied to clipboard

`Cldr.AcceptLanguage.best_match` not returning nearest locale

Open hugomorg opened this issue 2 months ago • 40 comments

Hi @kipcole9. First of all, thanks for the great lib!

We are using cldr for matching "accept-language" headers to locales which we support, so that we can translate content properly (via gettext), specifically, the best match function.

The issue we are facing is that some locales which we do not support, are not falling back to nearest locales. For example, given that we support "es_ES", "en_US", "zh_CN", "zh_HK":

gettext_locale_name = fn locale ->
  {:ok, tag} = MyApp.Cldr.AcceptLanguage.best_match(locale)
  tag.gettext_locale_name
end

iex> gettext_locale_name.("es-ES")
"es_ES"

iex> gettext_locale_name.("es-US")
nil # should be "es_ES"

iex> gettext_locale_name.("en-AU")
"en" # should be "en_US"

iex> gettext_locale_name.("zh-Hans")
"zh_CN"

iex> gettext_locale_name.("zh-Hant")
nil # should be "zh_HK"

I couldn't find a function in cldr which improves on this.

A workaround is stripping off the variant suffix and going through the function again with the top level language, e.g. "es". Alternatively for Chinese variants we can look at the script. But wondering if there is a less manual way to do this.

In case we misconfigured something, our config is:

defmodule MyApp.Cldr do
  use Cldr,
    otp_app: :my_app,
    gettext: MyApp.Gettext,
    providers: [],
    locales:
      MyApp.Gettext
      |> Gettext.known_locales()
      |> Enum.map(&String.replace(&1, "_", "-"))
end

Tagging my colleague @andreyuhai

hugomorg avatar Oct 07 '25 09:10 hugomorg

Thanks for the kind words, its much appreciated.

I regret that it's unlikely that MyApp.Cldr.AcceptLanguage.best_match/1 is going to satisfy your requirements - at least in its current form. CLDR locale names have no way to fallback to locale names it doesn't know about. And it doesn't know about en-US or es-ES or zh-HK.

How CLDR names locales

This is because in CLDR there is no such locale as en-US, es-ES or zh-Hans. CLDRs locale naming is that the territory with the most native speakers of that language has a plain undecorated locale ID. For example:

iex(1)> Cldr.validate_locale!("en-US").cldr_locale_name
:en
iex(2)> Cldr.validate_locale!("es-ES").cldr_locale_name
:es

es-US will fallback to es, not es-ES

Note that the resolved locale will, in ex_cldr, still have the territory set to US because that is valid. But the underlying CLDR locale name is es, not es-ES.

iex(1)> {:ok, locale} = MyApp.Cldr.AcceptLanguage.best_match("es-US")
{:ok, MyApp.Cldr.Locale.new!("es-US")}
iex(2)> locale.cldr_locale_name
:es

That means that the best match - which is the best match to a CLDR Locale ID - can never match en-US, es-ES. Or even pt-BR.

In your other example:

  • en-AU will fall back to en by the same rules. en is the CLDR locale name for en-US.
  • zn-Hant exists as a CLDR locale name but its not configured in your CLDR backend. If it was, then it would resolve to that name (which by the same rules would resolve to effectively zh-Hant-TW. In CLDR there is both zh-Hant-HK and zh-Hans-HK.

And the one that surprises me:

Your example shows:

iex> gettext_locale_name.("zh-Hans")
"zh_CN"

But I see:

iex(8)> {:ok, locale} = MyApp.Cldr.AcceptLanguage.best_match("zh-Hans")
{:ok, MyApp.Cldr.Locale.new!("zh-Hans-CN")}
iex(9)> locale.cldr_locale_name
:zh

Which is what I expected. Again, but the rules, zh has what CLDR calls "likely subtags" and they include Hans for the script and CN for the territory. You can see what the likely subtags are by:

iex(10)> Cldr.Locale.likely_subtags("zh")
#Cldr.LanguageTag<zh-Hans-CN [parsed]>

Possible path forward

The most immediate path forward I see, if you want to use MyApp.Cldr.AcceptLanguage.best_match/1, is to rename your Gettext locales to be consistent with the CLDR locale naming conventions.

Next steps

I am curious about how a best match for zh-Hans cam to be zh-CN since there is no such locale ID so I need to investigate that further - unless perhaps that was a copy/paste error on your side?

kipcole9 avatar Oct 07 '25 11:10 kipcole9

Are you able to share what the output of:

  1. MyApp.Cldr.known_locale_names()
  2. MyApp.Cldr.__cldr__(:config)
  3. `Gettext.known_locales(MyApp.Gettext)

Look like?

kipcole9 avatar Oct 07 '25 11:10 kipcole9

This is because in CLDR there is no such locale as en-US, es-ES or zh-Hans. CLDRs locale naming is that the territory with the most native speakers of that language has a plain undecorated locale ID. For example:

iex(1)> Cldr.validate_locale!("en-US").cldr_locale_name :en iex(2)> Cldr.validate_locale!("es-ES").cldr_locale_name :es

I can see from the linked repo, we do indeed have es_ES, en_US but not zh_Hans. But those files have basically no content, which I suppose is what you also mean by no locale.

However, I was under the impression that CLDR used BCP47 locale formats (which is the standard we follow), or at least that a mapping between the two here would happen given that the "accept-language" header uses this format.

es-US will fallback to es, not es-ES

Note that the resolved locale will, in ex_cldr, still have the territory set to US because that is valid. But the underlying CLDR locale name is es, not es-ES.

iex(1)> {:ok, locale} = MyApp.Cldr.AcceptLanguage.best_match("es-US") {:ok, MyApp.Cldr.Locale.new!("es-US")} iex(2)> locale.cldr_locale_name :es

That means that the best match - which is the best match to a CLDR Locale ID - can never match en-US, es-ES. Or even pt-BR.

Aha, maybe that's where we misunderstood. Because we were assuming that, "given this list of locales we want to support, we want the closest match to be returned from this list".

In your other example:

* `en-AU` will fall back to `en` by the same rules. `en` _is_ the CLDR locale name for `en-US`.

* `zn-Hant` exists as a CLDR locale name but its not configured in your CLDR backend. If it was, then it would resolve to that name (which by the same rules would resolve to effectively `zh-Hant-TW`.  In CLDR there is both `zh-Hant-HK` and `zh-Hans-HK`.

Yep, but under BCP47 I think that zh-Hant is still valid as traditional Chinese script, whose nearest match should be zh-HK.

And the one that surprises me:

Your example shows:

iex> gettext_locale_name.("zh-Hans") "zh_CN"

But I see:

iex(8)> {:ok, locale} = MyApp.Cldr.AcceptLanguage.best_match("zh-Hans") {:ok, MyApp.Cldr.Locale.new!("zh-Hans-CN")} iex(9)> locale.cldr_locale_name :zh

Which is what I expected. Again, but the rules, zh has what CLDR calls "likely subtags" and they include Hans for the script and CN for the territory. You can see what the likely subtags are by:

iex(10)> Cldr.Locale.likely_subtags("zh") #Cldr.LanguageTag<zh-Hans-CN [parsed]>

Possible path forward

The most immediate path forward I see, if you want to use MyApp.Cldr.AcceptLanguage.best_match/1, is to rename your Gettext locales to be consistent with the CLDR locale naming conventions.

Next steps

I am curious about how a best match for zh-Hans cam to be zh-CN since there is no such locale ID so I need to investigate that further - unless perhaps that was a copy/paste error on your side?

Sure, let me paste full output:

iex> show_locale = fn locale -> {:ok, locale} = MarketplaceSearch.Cldr.AcceptLanguage.best_match(locale); Map.from_struct(locale) end

iex> show_locale.("zh-Hans")
%{
  script: :Hans,
  extensions: %{},
  transform: %{},
  language: "zh",
  locale: %{},
  backend: MarketplaceSearch.Cldr,
  cldr_locale_name: :zh,
  gettext_locale_name: "zh_CN",
  territory: :CN,
  requested_locale_name: "zh-Hans",
  canonical_locale_name: "zh-Hans",
  language_variants: [],
  rbnf_locale_name: :zh,
  language_subtags: [],
  private_use: []
}

hugomorg avatar Oct 07 '25 16:10 hugomorg

Are you able to share what the output of:

1. `MyApp.Cldr.known_locale_names()`
iex(5)> MarketplaceSearch.Cldr.known_locale_names
[:ar, :bg, :cs, :da, :de, :el, :en, :"en-CA", :"en-GB", :es, :fi, :fr, :"fr-CA",
 :hr, :hu, :id, :it, :ja, :lt, :ms, :nb, :nl, :pl, :pt, :ro, :ru, :sl, :sv, :th,
 :uk, :vi, :zh]
2. `MyApp.Cldr.__cldr__(:config)`
iex(6)> MarketplaceSearch.Cldr.__cldr__(:config)
%Cldr.Config{
  default_locale: :en,
  locales: [:ar, :bg, :cs, :da, :de, :el, :en, :"en-CA", :"en-GB", :es, :fi,
   :fr, :"fr-CA", :hr, :hu, :id, :it, :ja, :lt, :ms, :nb, :nl, :pl, :pt, :ro,
   :ru, :sl, :sv, :th, :uk, :und, :vi, :zh],
  add_fallback_locales: false,
  backend: MarketplaceSearch.Cldr,
  gettext: MarketplaceSearch.Gettext,
  data_dir: "/app/_build/prod/lib/marketplace_search/priv/cldr",
  providers: [],
  precompile_number_formats: [],
  precompile_transliterations: [],
  precompile_date_time_formats: [],
  precompile_interval_formats: [],
  default_currency_format: nil,
  otp_app: :marketplace_search,
  generate_docs: true,
  suppress_warnings: false,
  message_formats: %{},
  force_locale_download: false,
  https_proxy: nil
}
3. `Gettext.known_locales(MyApp.Gettext)
iex(7)> Gettext.known_locales(MarketplaceSearch.Gettext)
["ar", "bg_BG", "cs_CZ", "da_DK", "de_DE", "el_GR", "en_CA", "en_GB", "en_US",
 "es_ES", "fi_FI", "fr_CA", "fr_FR", "hr_HR", "hu_HU", "id_ID", "it_IT",
 "ja_JP", "lt_LT", "ms_MY", "nb_NO", "nl_NL", "pl_PL", "pt", "pt_BR", "ro_RO",
 "ru_RU", "sl_SI", "sv_SE", "th_TH", "uk_UA", "vi_VN", "zh_CN", "zh_HK"]

Look like?

Thanks for the prompt response.

hugomorg avatar Oct 07 '25 16:10 hugomorg

Yep, but under BCP47 I think that zh-Hant is still valid as traditional Chinese script, whose nearest match should be zh-HK.

zh-Hant is definitely a valid locale name and there is a CLDR locale called zh-Hant. If you configure zh-Hant-HK in your ex_cldr backend you will then see:

iex(1)> {:ok, locale} = MyApp.Cldr.AcceptLanguage.best_match("zh-HK")
{:ok, MyApp.Cldr.Locale.new!("zh-Hant-HK")}

# but ......
iex(2)> {:ok, locale} = MyApp.Cldr.AcceptLanguage.best_match("zh-Hant")
{:ok, MyApp.Cldr.Locale.new!("zh-Hant-TW")}

Which I will definitely look at. The first example is what I expected. The second example is not. I think the second example should best match also to zh-Hant-HK which I understand is your expectation too. And that should then match to your gettext locale of zh_HK

A lot of this complexity comes from having to do two kinds of matches:

  1. Find the best fit configured CLDR locale
  2. Then find its best fitting Gettext locale

Next steps

  1. I think your kind efforts have identified that at least best_match/1 should resolve zn-Hant to zh-Hant-HK when zh-Hant-HK is configured in ex_cldr. And then that should resolve to zh_HK in your gettext backend.
  2. I will also revisit the matching I'm using to find a gettext locale. I think it should be possible to match CLDR's en to your gettext en_US given that the derived territory for en is US.

its a horrible hour in my zone now so give me a few hours sleep and I'll dig into this in my morning and resolve one way or another as quickly as I can.

kipcole9 avatar Oct 07 '25 17:10 kipcole9

TLDR; (but do please read). I can do as you'd like to best match zh-Hant to zh-Hant-HK but I'm not sure its a good idea, not sure its sustainable and it definitely won't resolve that way in the upcoming localize library in 2026.

I've been thinking about this all morning, especially what the right best match is for zh-Hant. In CLDR, that is unambiguously zh-Hant-TW because without any hinting, the territory with the largest number of native zh-Hant speakers is TW.

This became more important when we get to next year and I launch localize which is basically "ex_cldr version 3.0". It will have no concept of locale configuration or backends. All CLDR locales will always be available and they'll be dynamically loaded on demand.

Therefore, if I apply hinting now so that zh-Hant will best match zh-Hant-HK if its configured, that hinting won't be useful in the future. Perhaps you'll reasonably say you don't care about that for now - you just need a solution now!

I do have enough data to be able to apply a hint based upon configuration. There is data (not very reliable, but good enough for this) to know which languages map have which primary scripts and which territories that applies to. So I can take zh-Hant and, through that data returned by Cldr.Config.language_data/0, know to check if any of the territories listed as :primary is configured and then use that for best match.

However, as I mentioned before, the data behind Cldr.Config.language_data/0 is brittle, and actually the key territory data is removed in CLDR 48 (coming end of this month). I can still derive enough for your requirement but it's brittle. And when localize comes out, this configuration hinting won't work since all locales are always available and zh-Hant will always best match to zh-Hant-TW.

kipcole9 avatar Oct 08 '25 02:10 kipcole9

Yep, but under BCP47 I think that zh-Hant is still valid as traditional Chinese script, whose nearest match should be zh-HK.

zh-Hant is definitely a valid locale name and there is a CLDR locale called zh-Hant. If you configure zh-Hant-HK in your ex_cldr backend you will then see:

iex(1)> {:ok, locale} = MyApp.Cldr.AcceptLanguage.best_match("zh-HK") {:ok, MyApp.Cldr.Locale.new!("zh-Hant-HK")}

but ......

iex(2)> {:ok, locale} = MyApp.Cldr.AcceptLanguage.best_match("zh-Hant") {:ok, MyApp.Cldr.Locale.new!("zh-Hant-TW")}

Which I will definitely look at. The first example is what I expected. The second example is not. I think the second example should best match also to zh-Hant-HK which I understand is your expectation too. And that should then match to your gettext locale of zh_HK

Yep 👍 that's what we would expect

A lot of this complexity comes from having to do two kinds of matches:

1. Find the best fit configured CLDR locale

2. Then find its best fitting Gettext locale

It is a tricky challenge indeed.

Next steps

1. I think your kind efforts have identified that at least `best_match/1` should resolve `zn-Hant` to `zh-Hant-HK` when `zh-Hant-HK` is configured in `ex_cldr`. And then that should resolve to `zh_HK` in your gettext backend.

2. I will also revisit the matching I'm using to find a gettext locale.  I think it should be possible to match CLDR's `en` to your gettext `en_US` given that the derived territory for `en` is `US`.

its a horrible hour in my zone now so give me a few hours sleep and I'll dig into this in my morning and resolve one way or another as quickly as I can.

hugomorg avatar Oct 09 '25 09:10 hugomorg

TLDR; (but do please read). I can do as you'd like to best match zh-Hant to zh-Hant-HK but I'm not sure its a good idea, not sure its sustainable and it definitely won't resolve that way in the upcoming localize library in 2026.

I've been thinking about this all morning, especially what the right best match is for zh-Hant. In CLDR, that is unambiguously zh-Hant-TW because without any hinting, the territory with the largest number of native zh-Hant speakers is TW.

This became more important when we get to next year and I launch localize which is basically "ex_cldr version 3.0". It will have no concept of locale configuration or backends. All CLDR locales will always be available and they'll be dynamically loaded on demand.

Oh nice, looking forward to it 🙌 !

Therefore, if I apply hinting now so that zh-Hant will best match zh-Hant-HK if its configured, that hinting won't be useful in the future. Perhaps you'll reasonably say you don't care about that for now - you just need a solution now!

I do have enough data to be able to apply a hint based upon configuration. There is data (not very reliable, but good enough for this) to know which languages map have which primary scripts and which territories that applies to. So I can take zh-Hant and, through that data returned by Cldr.Config.language_data/0, know to check if any of the territories listed as :primary is configured and then use that for best match.

However, as I mentioned before, the data behind Cldr.Config.language_data/0 is brittle, and actually the key territory data is removed in CLDR 48 (coming end of this month). I can still derive enough for your requirement but it's brittle. And when localize comes out, this configuration hinting won't work since all locales are always available and zh-Hant will always best match to zh-Hant-TW.

So if I understand correctly, even with localize we are unlikely to find the mapping we need here via cldr between BCP47 locales and gettext keys? We would need to either modify our translation file names or adjust the incoming locales?

hugomorg avatar Oct 09 '25 10:10 hugomorg

@hugomorg I've been reflecting on tis a while and I think there is a reasonable compromise I can implement.

  1. When resolving CLDR locales, and doing a best match, ex_cldr should continue to use CLDR best match rules. That means the a best match for zh-Hant will resolve to zh-Hant-TW for the CLDR locale.
  2. However, when matching to a gettext locale, I can be more liberal and zh-Hant-TW can aim to best match with a gettext locale of zh_HK. The only tricky thing is to work out how the data can make that work. I think the mapping data can be resolved in CLDR but I need to work on that.

For me, the good news is that approach remains compliant with CLDRs spec while at the same time still being able to match more liberally with Gettext locales.

You've mentioned a couple of times that a best match for zh-Hant should be possible to zh-HK (really, zh-Hant-HK I'd say but Hant is the primary script for HK in CLDR as you'd expect.

Do you have a reference for that? I haven't come across that path in the CLDR spec. And it may just be that my brain is fried from decoding TR35 for the last 8 years!

If this approach is ok with you I should be able to get something testable done by the weekend if not before.

kipcole9 avatar Oct 09 '25 10:10 kipcole9

@hugomorg I've been reflecting on tis a while and I think there is a reasonable compromise I can implement.

1. When resolving CLDR locales, and doing a best match, `ex_cldr` should continue to use CLDR best match rules. That means the a best match for `zh-Hant` will resolve to `zh-Hant-TW` for the **CLDR** locale.

2. However, when matching to a gettext locale, I can be more liberal and `zh-Hant-TW` can aim to best match with a gettext locale of `zh_HK`. The only tricky thing is to work out how the data can make that work.  I think the mapping data can be resolved in CLDR but I need to work on that.

For me, the good news is that approach remains compliant with CLDRs spec while at the same time still being able to match more liberally with Gettext locales.

I think I may be misunderstanding then.

I thought that the BCP47 standard was more compatible with CLDR than seems to be case based on what you are suggesting (and no doubt you know more about these standards than I)!

You've mentioned a couple of times that a best match for zh-Hant should be possible to zh-HK (really, zh-Hant-HK I'd say but Hant is the primary script for HK in CLDR as you'd expect.

Do you have a reference for that? I haven't come across that path in the CLDR spec. And it may just be that my brain is fried from decoding TR35 for the last 8 years!

If this approach is ok with you I should be able to get something testable done by the weekend if not before.

Thanks for considering a change here. In our case, I thought that zh-HK and es-US would be regarded as valid locale ids (at least according to BCP47) due to the {language_code}-{region_code} syntax. And I also assumed zh-Hant should essentially be equivalent to zh-TW, but the nearest neighbour after that should be zh-HK. I'm curious, is there any particular algorithm you are using to match these under the hood?

hugomorg avatar Oct 09 '25 13:10 hugomorg

I thought that the BCP47 standard was more compatible with CLDR than seems to be case based on what you are suggesting (and no doubt you know more about these standards than I)!

Definitely compatible. CLDR is mostly compliant with BCP 47 locale IDs. And your locale names are definitely BCP 47 compliant and compatible with CLDR. No issue there. The full description of conformance is here. BTW, my understanding is that BCP 47 Locale IDs use - not _ but ex_cldr doesn't care which one you use.

Cldr.AcceptLanguage.best_match/1 is primarily focused on returning a language tag that most closely matches a CLDR locale configured in the system. To do that, the process is roughly:

  1. Canonicalise each of the potential locale ID from the Accept-Language header using Cldr.Locale.canonical_language_tag/2. The implementation mostly follows this. The overall process involves parsing the resolving aliases and applying likely subtags. I'm pretty confident this code is correct (per the spec) given it passes the conformance tests - nearly 2_000 of them.
  2. If the resulting language tag has the field :cldr_locale_name filled in the it becomes a candidate to be chosen
  3. Sort the candidates by their q score and pick the one with the highest score.

The next question is "how does the :cldr_locale_name" get filled in? The primary process follows Language Matching in TR35. But the implementation does take a few shortcuts and this conversation will definitely prompt me to revisit. Nevertheless, from a CLDR perspective, the language matching is quite robust in finding the most appropriate :cldr_locale_name for a given locale ID.

Then lastly, the question is how do we link to a Gettext locale ID. Not surprisingly this isn't a CLDR concern and the implementation is logical but ad hoc. Basically it does a reductive check of the combinations of language, script and region and checks that against Gettext locale names. Hence why I believe there is room to improve and be able to match a :cldr_locale_name of zh-Hant to a gettext locale of zh_HK and en to en_US. I've worked on a PoC for that tonight and I think I can have something for you to on Saturday (I won't have much time on Friday to work on this).

kipcole9 avatar Oct 09 '25 15:10 kipcole9

I spent the weekend mapping out how to do this and I've got a good plan. And I found some good test data trolling the CLDR repo. I just need a few days to implement - I'm confident it will resolve this issue in a way that works well for your use case.

kipcole9 avatar Oct 13 '25 11:10 kipcole9

I spent the weekend mapping out how to do this and I've got a good plan. And I found some good test data trolling the CLDR repo. I just need a few days to implement - I'm confident it will resolve this issue in a way that works well for your use case.

Hi @kipcole9 apologies for the delay, I was off for a few days. Thank you for continuing to look into this.

Would this also help with those "non-conventional" locales, e.g. mapping es-US to es-ES (if only the latter has been registered as a locale)?

hugomorg avatar Oct 15 '25 14:10 hugomorg

Apologies for the delay - I was stuck finishing up some significant work on ex_cldr_dates_times which is now done.

I have pushed two commits (e1d175c0e038a81de158d11b936292eacae5e391 and e655ef6296bcb391e06327cb8dd9138ee54c1fb5) that implement the CLDR language matching algorithm and it is returning (as expected) much better results. For example:

iex> Cldr.Locale.Match.best_match("zh-HK"),
...>   supported: ["zh", "zh-Hans", "zh-Hant", "en", "fr", "en-Hant"]
[{"zh-Hant", 10}, {"zh", 59}, {"zh-Hans", 59}, {"en-Hant", 89}]

It also improves locale matching for your es-US versus es-MX question:

iex> Cldr.Locale.Match.best_match "es-US", supported: ["es-ES", "es-MX", "es-AR"]
[{"es-MX", 9}, {"es-AR", 9}, {"es-ES", 10}]

Note that in this example, es-MX matches more closely with es-US than es_ES. This is because the Spanish spoken in the Americas has greater affinity than with the Spanish spoken in Europe.

You'll see similar affinity in the English variations too:

iex> Cldr.Locale.Match.best_match "en-AU", supported: ["en-CA", "en", "en-GB"]
[{"en-GB", 8}, {"en-CA", 10}, {"en", 10}]

Here, en-GB is considered a closer match than en (meaning en-US).

I have a small amount of work still to go on this but it's close now. I will post here when I have a version you can try.

kipcole9 avatar Oct 26 '25 04:10 kipcole9

Primarily I need to apply this matching to the gettext_locale_name field in a language tag and finalise some work on what CLDR calls "paradigm locales" which I don't fully understand yet.

kipcole9 avatar Oct 26 '25 04:10 kipcole9

If you're up for an early test, I've pushed a release candidate to GitHub. You can configure it by:

{:ex_cldr, github: "elixir-cldr/cldr48"}

You should find that the :gettext_locale_name is populated as you expect. If not - it's a bug on my side.

Note that other ex_cldr_* libraries need updating to work with this version of ex_cldr. In particular current hex ex_cldr_dates_times most likely won't work. There are a lot of updates coming to various libs to support CLDR48. All of which should be done in the next week.

kipcole9 avatar Oct 27 '25 20:10 kipcole9

I should have noted there are still 12 test cases on locale matching that are failing (out of 125 or so) so I still have some work to do on the implementation but I believe the current code covers the configuration you described for your use case.

kipcole9 avatar Oct 27 '25 21:10 kipcole9

Hey @kipcole9, thank you. This already looks like a big difference for us! I will do more testing but results below look solid.

I noticed that the underscore has turned into a hyphen but this is minor.

Setup:

gettext_locale_name = fn locale ->
  {:ok, tag} = MarketplaceSearch.Cldr.AcceptLanguage.best_match(locale)
  tag.gettext_locale_name
end

Enum.each(["es-ES", "es-US", "en-AU", "zh-Hans", "zh-Hant"], fn locale ->
  IO.puts("#{locale} -> #{gettext_locale_name.(locale) || "nil"}")
end)

Before:

es-ES -> es_ES
es-US -> nil
en-AU -> en
zh-Hans -> zh_CN
zh-Hant -> nil

After:

es-ES -> es-ES
es-US -> es-ES
en-AU -> en-GB
zh-Hans -> zh-CN
zh-Hant -> zh-HK

hugomorg avatar Oct 28 '25 10:10 hugomorg

Thanks for the feedback, glad it shaping up. I made an effort to return a locale name that is the same format as the requested one so I'll look back into that - :gettext_locale_name needs to be the same name as gettext knows it for obvious reasons. Thanks for pointing that out. I'll have another spin ready in my early morning (UTC+11).

kipcole9 avatar Oct 28 '25 11:10 kipcole9

I've pushed an update to https://github.com/elixir-cldr/cldr48 that is, I hope, the final release candidate of ex_cldr version 2.44.0. If you have a chance, would you mind mix deps.update ex_cldr and testing one more time?

  • :gettext_locale_names are now exactly as they are named on disk
  • Cldr.Locale.Match.best_match/2 is passing 125 tests and failing on 4. These need further investigation but they are very edge cases that shouldn't prevent publishing the new verison
  • A bug in locale resolution has been fixed for locales with the language code und. Not a language code which would occur in normal use.

I aim to get new versions of the several ex_cldr_* libs that are updated for CLDR 48 published by Monday 3rd.

kipcole9 avatar Oct 30 '25 12:10 kipcole9

Hey @kipcole9, with {:ex_cldr, github: "elixir-cldr/cldr48"} I'm now getting an error. I haven't changed any code or translations.

** (Cldr.UnknownLocaleError) Failed to install the locale named "bg-BG". The locale name is not known.
    (ex_cldr 2.44.0-rc.6) lib/cldr/install.ex:93: Cldr.Install.do_install_locale_name/3
    (elixir 1.16.2) lib/enum.ex:987: Enum."-each/2-lists^foreach/1-0-"/2
    (ex_cldr 2.44.0-rc.6) lib/cldr/install.ex:29: Cldr.Install.install_known_locale_names/1
    (ex_cldr 2.44.0-rc.6) lib/cldr.ex:102: Cldr.install_locales/1
    (ex_cldr 2.44.0-rc.6) expanding macro: Cldr.Backend.Compiler.__before_compile__/1
    lib/marketplace_search/cldr.ex:1: MarketplaceSearch.Cldr (module)

hugomorg avatar Oct 31 '25 17:10 hugomorg

@hugomorg apologies, thats not great - and not expected. I'm not even sure where bg-BG comes from because your gettext locale is bg_BG I assume? Working on this now.

kipcole9 avatar Oct 31 '25 17:10 kipcole9

@kipcole9 thanks for the quick reply.

The hyphen was appearing so I could transform the gettext names:

defmodule MarketplaceSearch.Cldr do
  @moduledoc """
  CLDR configuration module for MarketplaceSearch.
  """

  use Cldr,
    otp_app: :marketplace_search,
    gettext: MarketplaceSearch.Gettext,
    providers: [],
    # Using locales with a hyphen works with CLDR, but when CLDR maps them back to a gettext locale,
    # it uses the underscore as a separator. Meanwhile, underscored locales passed straight to CLDR
    # don't work. So we need to do the transform here.
    locales:
      MarketplaceSearch.Gettext
      |> Gettext.known_locales()
      |> Enum.map(&String.replace(&1, "_", "-"))
end

But when I comment out the replace, I get a similar error

== Compilation error in file lib/marketplace_search/cldr.ex ==
** (Cldr.UnknownLocaleError) Failed to install the locale named :bg_BG. The locale name is not known.
    (ex_cldr 2.44.0-rc.6) lib/cldr/install.ex:93: Cldr.Install.do_install_locale_name/3
    (elixir 1.16.2) lib/enum.ex:987: Enum."-each/2-lists^foreach/1-0-"/2
    (ex_cldr 2.44.0-rc.6) lib/cldr/install.ex:29: Cldr.Install.install_known_locale_names/1
    (ex_cldr 2.44.0-rc.6) lib/cldr.ex:102: Cldr.install_locales/1
    (ex_cldr 2.44.0-rc.6) expanding macro: Cldr.Backend.Compiler.__before_compile__/1
    lib/marketplace_search/cldr.ex:1: MarketplaceSearch.Cldr (module)

hugomorg avatar Oct 31 '25 17:10 hugomorg

Thanks for the clarity, that helps.

I think the actual issue is that I have overlooked matching gettext locale names to CLDR locale names. Meaning there is no 'bg-BG' in CLDR, just bg. And I haven't added the code to more flexibly match to a CLDR locale name.

And now I'm wondering how it ever worked with a bg_BG locale. Did you ever get a message similar to:

The locale bg_BG is configured in the gettext backend but is unknown to CLDR. ......

kipcole9 avatar Oct 31 '25 17:10 kipcole9

Yep, saw plenty of those warnings :)

Compiling lib/marketplace_search/gettext.ex (it's taking more than 10s)
note: The locales ["bg_BG", "cs_CZ", "da_DK", "de_DE", "el_GR", "en_US", "es_ES", "fi_FI", "fr_FR", "hr_HR", "hu_HU", "id_ID", "it_IT", "ja_JP", "lt_LT", "ms_MY", "nb_NO", "nl_NL", "pl_PL", "pt_BR", "ro_RO", "ru_RU", "sl_SI", "sv_SE", "th_TH", "uk_UA", "vi_VN", "zh_CN", "zh_HK"] are configured in the MarketplaceSearch.Gettext gettext backend but are unknown to CLDR. They will not be used to configure CLDR but they will still be used to match CLDR locales to Gettext locales at runtime

hugomorg avatar Oct 31 '25 17:10 hugomorg

I've pushed a commit and updated the elixir-cldr/cldr48 repo to include code that now better matches gettext locale names to CLDR locale names. It uses a simple method (for now) of just suffix stripping to find a match. I think that works for your use case (and others). It means bg_BG will configure :bg in CLDR and en_US will configure :en. Etc etc. I added additional testing for this as well.

You've been very patient with this which is greatly appreciated. This release will definitely be more solid in several areas as a result of your collaboration.

Would you mind updating one more time and confirming its ok (or not!).

kipcole9 avatar Oct 31 '25 18:10 kipcole9

With this last commit the contract now is:

  1. :gettext_locale_name is the same as the name on disk - whatever that was.
  2. Gettext locale names will be matched to a CLDR locale name by repeated suffix stripping to try and find a match.
  3. CLDR locale names will be matched to a Gettext locale name by using the new Cldr.Locale.Match.best_match/2 function.

In the future, Cldr.Locale.Match.best_match/2 will be used for all locale matching. Probably not until localize 0.1.0 in the new year.

kipcole9 avatar Oct 31 '25 18:10 kipcole9

Hey @kipcole9 sorry for the delay - I was away for a while. Will test this early next week.

hugomorg avatar Nov 14 '25 16:11 hugomorg

Hey @kipcole9, repeated my test above and it's looking good:

es-ES -> es_ES
es-US -> es_ES
en-AU -> en_GB
zh-Hans -> zh_CN
zh-Hant -> zh_HK

One more question: is there a reason why "en-US" resolves to "en" instead of "en-US"? This doesn't seem to be the case for other locales.

iex(18)> gettext_locale_name.("en-US")
"en"

iex(20)> gettext_locale_name.("en-CA")
"en_CA"

iex(23)> Enum.all?(["en_US", "en_CA"], & &1 in Gettext.known_locales(MarketplaceSearch.Gettext))
true

hugomorg avatar Nov 17 '25 10:11 hugomorg

In CLDR, en and en-US are synonymous (en is expanded to en-US). But due to some special handling of of paradigm locales, en will be preferred over en-US.

However, I think you would only see this as an issue if you have both en and en_US gettext locales? You can experiment using Cldr.Locale.Match.best_match/2 like this:

# Does en-US match with the gettext locale `en_US`? Yes.
iex(> Cldr.Locale.Match.best_match "en-US", supported: ["en_US"]
{:ok, "en_US", 0}

# What if we support both `en` and `en_US`? `en` will win, even though they
# both match.
iex> Cldr.Locale.Match.best_match "en-US", supported: ["en", "en_US"]
{:ok, "en", 0}

Is there some chance you have both en and en_US gettext locales?

kipcole9 avatar Nov 18 '25 08:11 kipcole9