icu4x
icu4x copied to clipboard
Transliteration/Segmentation bindings for external implementations
There are multiple clients that want to use public ICU API for transliteration and segmentation due to its prevalence in the industry => users wouldn't have to migrate code, just build rules and dependencies. These clients would like to provide their own implementation for some of the language pairs.
Our implementation should allow for:
- ICU4X not depending on external implementations
- ICU4X should expose API for implementers to bind to (abstract class comes to mind in Java/C++)
- It should be possible for implementer to point ICU4X build to the new dependency (or to expect external linkage to happen at some point).
For example, there are teams that specialize in ML models for language pair transliteration, esp Indic languages <-> Romanization, that do much better job than rule or dictionary based solutions. They would override some pairs with their solution, but fall back to our general approach for others. Similar problems are present for segmentation, and maybe other APIs.
This is similar to Budou/X issue #1803 - how do we include it into ICU4X without depending on it, and making it a special case.
Discussion:
- @sffc - We should focus on well-tested ICU4X engines, perhaps with features.
- @nciric - This is for specific clients to override so that they can use their code with our API.
- @echeran - In both Unicode Properties and MessageFormat, we're talking about making generic interfaces. Should we put interfaces everywhere? This theme keeps coming up.
- @Manishearth - We often discuss, do we make this thing pluggable? We need to look at who is doing the plugging and what are they plugging into. Let's say we provide a trait for external impls. Are there places in our code where they are plugging in their objects? Because Segmenter is the highest-level API we would have. Are we allowing people to override something inside Segmenter, or override the whole Segmenter?
- @sffc - If they're overriding the engine just for CJK, then that would be pluggable into ICU4X. But if we're overriding all of Segmenter, a trait would be better. Also, we have the data provider as a way to do overrides; we should avoid adding dozens of different ways to do overrides.
Actino for 1.0 is to make sure we're not boxing ourselves into a corner with the currently proposed APIs.
I think we are mostly future-proof here. WordBreakSegmenter has private fields, so we could potentially add more private fields in the future. Adding a generic parameter to WordBreakSegmenter would be a breaking change, but we could use trait objects in the interim, and add the generic parameter in 2.0 if required. I am therefore going to mark this issue as resolved for 1.0 purposes.
@FrankYFTang is working on this in ICU4C. We should coordinate with him at that time.
We can do this in a non-breaking way.