icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Transliterator datagen should allow for slicing individual baked data transliterators

Open Manishearth opened this issue 9 months ago • 3 comments

Related: https://github.com/unicode-org/icu4x/issues/3966

Currently with Transliterator, all transliterators are under the same data key, as different und-t-blah locales. This is hard to slice; it basically requires users to manually run datagen to get any slicing.

For blob data I'm not too worried about that: it would be nice to still have ways to slice that (https://github.com/unicode-org/icu4x/issues/3966), but I'm okay with people performing some manual slicing here, because automatic slicing would potentially have to parse the transliterators themselves[^1].

But for baked data, this is not great.

I think we can structure transliterator baked data somewhat differently: datagen can produce the following:

const DATA_TRANSLITERATOR_LATIN_HAN = ...;
const DATA_TRANSLITERATOR_LATIN_GREEK = ...;

const DATA_TRANSLITERATOR_RULES_V1: icu_provider_baked::zerotrie::Data<icu::experimental::transliterate::provider::TransliteratorRulesV1> = {
   const TRIE: _ = ...;
   const VALUES: _ = [DATA_TRANSLITERATOR_LATIN_GREEK, DATA_TRANSLITERATOR_LATIN_HAN, ...];
   ... 

}

pub mod ctors {
    pub fn new_transliterator_latin_han() -> Transliterator {
       Transliterator::new_internal(DATA_TRANSLITERATOR_LATIN_HAN, ...);
    }
}

Ideally, ::new_internal() has a solution to https://github.com/unicode-org/icu4x/issues/3966, where you can pass in something like Transliterator::new_internal(TRANSLITERATOR_LATIN_HAN, TransliteratorDeps { casemapper: Some(CaseMapper::new(), normalizer: ..., ... })

And then the calling crate can call pub use ctors::* somewhere.

[^1]: Maybe we can have a transliterator!() macro that embeds the transliterator string into the binary so that keyextract can pick it up and read it.

Manishearth avatar Mar 06 '25 23:03 Manishearth

Another way to do this would be to have one key per transliterator, and also have an omnibus key that internally loads the correct data. We can also teach datagen to do that.

Manishearth avatar Mar 06 '25 23:03 Manishearth

cc @sffc @robertbastian

Manishearth avatar Mar 06 '25 23:03 Manishearth

I think I like the approach of pulling out all the singletons so that they can be directly referenced, but also keeping the single data marker with attributes. I don't think one-marker-per-transliterator will work for reasons we've discussed before, but most importantly because we should be able to load an arbitrary CLDR JSON with an arbitrary set of transliterators inside.

sffc avatar Mar 07 '25 01:03 sffc