icu4x
icu4x copied to clipboard
Access to collation elements
Is there a way to access the collation elements, similar to the collation elements iterator in ICU4C? This would be useful for generating things like radix trees for autocompletion.
@hsivonen thoughts on this?
I'm a bit worried about exposing this level of implementation detail, but perhaps we could hide the concrete values inside a newtype that supported comparisons as well as rewriting the weights for "case first" or shifting punctiation from the primary to quaternary level without exposing the actual integers. That would also protect against anyone serializing these and expecting the serialized values to be valid across an ICU4X/CLDR data update.
If we were to pursue this, we should look into whether we can declare the NO_CE bit pattern as a niche so that the public API could use the idiomatic None without using up more bits.
If we did this, would we want to expose script reordering, too, as it is needed to fully implement the same comparison functionality on top of the elements that the collator itself provides?
@markusicu Can you weigh in on whether you believe it is advisable to expose this information in ICU4X?
Another potential use case for this: checking a string against a set of candidate strings, where secondary/tertiary differences are only treated as significant if there are differences in candidates that compare equal at the primary level.
E.g. given search string "abc" and candidates "AbC" and "ABC", the search should match "AbC" and not "ABC". The tertiary differences for a/A and c/C are ignored, but the difference b/B is significant because this differs in the candidate set.
Somewhat related: https://github.com/unicode-org/icu4x/issues/2689
@markusicu Can you weigh in on whether you believe it is advisable to expose this information in ICU4X?
This is kind of the same as sort keys: Unstable, but useful for certain things. In ICU, collation elements are mostly used inside the collation-based StringSearch. @ajtribick suggests a variation of string search (which I have not seen before).
So not core API that most people would/should use, but not alien either. Would want lots of caveats about instability of raw values across versions.