icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Datagen options around missing locales

Open sffc opened this issue 2 years ago • 5 comments

tl;dr, what should we do when a user tries to export a locale from datagen that isn't in CLDR?

At first thought, it seems that we should inform the client by failing datagen. However, this is more nuanced. Some caveats:

  1. Not all keys support the same set of locales.
  2. Not all keys have data at the root locale (example: collator/reord@1).
  3. Some keys require either an extension (example: datetime/skeletons@1) or soon an auxiliary key.

Because of caveats 2 and 3, we cannot always simply run fallback and fill in the data based on fallback.

@Manishearth suggested tagging data keys that aren't expected to fall back to root with some extra metadata. This fixes caveats 2 and 3 but not 1.

Related questions:

  1. Should the behavior depend on whether there was an explicit locale or whether the set of locales came from a CLDR set?
  2. Should the behavior depend on the fallback mode (i.e., should Precomputed be stricter than Hybrid)?

CC @robertbastian

sffc avatar Jul 27 '23 05:07 sffc

  • @zbraniecki - My gut feeling is that we should fail loudly only when we don't have a fallback. If there's a fallback, we could print a warning. We could build a tool that generates a list of what is there and what needs fallback and what is missing.
  • @sffc - Maybe we should make a flag that changes this behavior. For the time being, retain 1.2 behavior of failing silently.

Conclusion: retain 1.2 behavior for the time being; print a log statement for the error cases; revisit in 2.0

sffc avatar Jul 27 '23 18:07 sffc

Proposal:

  1. Continue printing a warning in datagen when a request language falls back to und@ro in an unexpected way
  2. Do NOT retain the base language if the language is not in CLDR

LGTM: @robertbastian @sffc

Discussion:

  • @zbraniecki - I think the datagen API should return me a list of locales it failed to generate. OK for the CLI to print warnings.

Conclusion: @sffc/@robertbastian to design an API for this.

Manishearth avatar Mar 15 '24 09:03 Manishearth

First thing we need is a clear definition of what it means when we say "failed to generate". The main thing is that all data can fall back to root, and this is the expected behavior in many cases.

I might propose the following definition: "the requested langid has no ancestors that are in the list in availableLocales.json". Unfortunately this definition only works with DatagenProvider as a data source.

A cleaner definition might be to just require that source providers in DatagenDriver return non-und data for all languages they support (i.e. RetainBaseLanguages-like behavior), and we can make sure DatagenProvider does this.

Once we decide on the definition, for the API, I think DatagenDriver::export should return a struct such as

#[non_exhaustive]
pub struct ExportResult {
    pub missing_locales: Vec<LanguageIdentifier>
}

On the CLI, we just send the list through log::info!.

Feedback? @robertbastian @zbraniecki

sffc avatar May 30 '24 19:05 sffc

unrelated to missing locales, but I have another use case for ExportResult, returning the crates that a baked exporter needs. This is currently only logged.

robertbastian avatar Jun 25 '24 15:06 robertbastian

Proposal:

  • In 2.0, add non_exhaustive ExportResult and return it from datagen driver
  • Punt the rest of this issue until we have clearer requirements on the use case

LGTM: @sffc @robertbastian

sffc avatar Jun 25 '24 15:06 sffc