icu4x
icu4x copied to clipboard
Transliterator datagen should allow for slicing individual baked data transliterators
Related: https://github.com/unicode-org/icu4x/issues/3966
Currently with Transliterator, all transliterators are under the same data key, as different und-t-blah locales. This is hard to slice; it basically requires users to manually run datagen to get any slicing.
For blob data I'm not too worried about that: it would be nice to still have ways to slice that (https://github.com/unicode-org/icu4x/issues/3966), but I'm okay with people performing some manual slicing here, because automatic slicing would potentially have to parse the transliterators themselves[^1].
But for baked data, this is not great.
I think we can structure transliterator baked data somewhat differently: datagen can produce the following:
const DATA_TRANSLITERATOR_LATIN_HAN = ...;
const DATA_TRANSLITERATOR_LATIN_GREEK = ...;
const DATA_TRANSLITERATOR_RULES_V1: icu_provider_baked::zerotrie::Data<icu::experimental::transliterate::provider::TransliteratorRulesV1> = {
const TRIE: _ = ...;
const VALUES: _ = [DATA_TRANSLITERATOR_LATIN_GREEK, DATA_TRANSLITERATOR_LATIN_HAN, ...];
...
}
pub mod ctors {
pub fn new_transliterator_latin_han() -> Transliterator {
Transliterator::new_internal(DATA_TRANSLITERATOR_LATIN_HAN, ...);
}
}
Ideally, ::new_internal() has a solution to https://github.com/unicode-org/icu4x/issues/3966, where you can pass in something like Transliterator::new_internal(TRANSLITERATOR_LATIN_HAN, TransliteratorDeps { casemapper: Some(CaseMapper::new(), normalizer: ..., ... })
And then the calling crate can call pub use ctors::* somewhere.
[^1]: Maybe we can have a transliterator!() macro that embeds the transliterator string into the binary so that keyextract can pick it up and read it.
Another way to do this would be to have one key per transliterator, and also have an omnibus key that internally loads the correct data. We can also teach datagen to do that.
cc @sffc @robertbastian
I think I like the approach of pulling out all the singletons so that they can be directly referenced, but also keeping the single data marker with attributes. I don't think one-marker-per-transliterator will work for reasons we've discussed before, but most importantly because we should be able to load an arbitrary CLDR JSON with an arbitrary set of transliterators inside.