icu4x
icu4x copied to clipboard
Things to make Pattern more useful
We need at least:
- [ ] ULE impls with tests
- [ ] Plural selection support
CC @younies
For the plural selection support, I think the most straightforward option is:
icu_patterntakes an optional dependency onzerovecicu_patterngets a Count enum similar to the ones @younies added in the dimension crate- A new type in
icu_patternsits on top of a ZeroMap-like structure (probably without the ZeroMap itself) with the invariant that there is always at least one pattern, enforcing this invariant when deserializing and constructing - The new type can work with any other pattern variant
An abstraction in zerovec that would be a useful building block would be VarZeroDefaultVec with the following behavior:
- Has the same byte representation as
VarZeroVecor removes the first entry from the index list (given how gnarly the VarZeroVec code is, I'm inclined to use the same byte representation) - Checks that there is at least 1 element when deserializing, otherwise returning a
VarZeroVecFormatError - Has an infallible function
first()
This would be helpful not only here but also in other cases where we have a vec that we assume always has at least 1 item; we could get rid of some GIGO cases.
We can also add ZeroDefaultVec with similar behavior.
Or we could use MultiFieldsULE: https://github.com/unicode-org/icu4x/issues/5240
Thoughts @Manishearth?
The abstraction feels rather specific to me, or perhaps the name VarZeroDefaultVec doesn't really signal anything to me. Not opposed, but not convinced either.
I came up with the name DefaultVec based on Python defaultdict which is a structure where data access is always successful, returning a default value.
Yeah, that connection doesn't come easily to me from the name, and DefaultDict is less "at least one" and more "getter has a default".
I think a thing that would be nice here is the ability to use a custom validator function in make_ule/varule, which means you can get this kind of deserialization check without unsafe. The rest doesn't seem worth it to me so far.
I don't think we should shoehorn this into zerovec, we should just use a custom encoding. This would be a lot less work, and there's more optimisation potential: the other case is mandatory, the number of cases is so low that we don't need binary search, we can bitpack much better, etc.
Ok I'll try to make it specific to icu_pattern.