icu4x
icu4x copied to clipboard
Datagen options around missing locales
tl;dr, what should we do when a user tries to export a locale from datagen that isn't in CLDR?
At first thought, it seems that we should inform the client by failing datagen. However, this is more nuanced. Some caveats:
- Not all keys support the same set of locales.
- Not all keys have data at the root locale (example: collator/reord@1).
- Some keys require either an extension (example: datetime/skeletons@1) or soon an auxiliary key.
Because of caveats 2 and 3, we cannot always simply run fallback and fill in the data based on fallback.
@Manishearth suggested tagging data keys that aren't expected to fall back to root with some extra metadata. This fixes caveats 2 and 3 but not 1.
Related questions:
- Should the behavior depend on whether there was an explicit locale or whether the set of locales came from a CLDR set?
- Should the behavior depend on the fallback mode (i.e., should Precomputed be stricter than Hybrid)?
CC @robertbastian
- @zbraniecki - My gut feeling is that we should fail loudly only when we don't have a fallback. If there's a fallback, we could print a warning. We could build a tool that generates a list of what is there and what needs fallback and what is missing.
- @sffc - Maybe we should make a flag that changes this behavior. For the time being, retain 1.2 behavior of failing silently.
Conclusion: retain 1.2 behavior for the time being; print a log statement for the error cases; revisit in 2.0
Proposal:
- Continue printing a warning in datagen when a request language falls back to
und@ro in an unexpected way - Do NOT retain the base language if the language is not in CLDR
LGTM: @robertbastian @sffc
Discussion:
- @zbraniecki - I think the datagen API should return me a list of locales it failed to generate. OK for the CLI to print warnings.
Conclusion: @sffc/@robertbastian to design an API for this.
First thing we need is a clear definition of what it means when we say "failed to generate". The main thing is that all data can fall back to root, and this is the expected behavior in many cases.
I might propose the following definition: "the requested langid has no ancestors that are in the list in availableLocales.json". Unfortunately this definition only works with DatagenProvider as a data source.
A cleaner definition might be to just require that source providers in DatagenDriver return non-und data for all languages they support (i.e. RetainBaseLanguages-like behavior), and we can make sure DatagenProvider does this.
Once we decide on the definition, for the API, I think DatagenDriver::export should return a struct such as
#[non_exhaustive]
pub struct ExportResult {
pub missing_locales: Vec<LanguageIdentifier>
}
On the CLI, we just send the list through log::info!.
Feedback? @robertbastian @zbraniecki
unrelated to missing locales, but I have another use case for ExportResult, returning the crates that a baked exporter needs. This is currently only logged.
Proposal:
- In 2.0, add non_exhaustive
ExportResultand return it from datagen driver - Punt the rest of this issue until we have clearer requirements on the use case
LGTM: @sffc @robertbastian