icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Things to make Pattern more useful

Open sffc opened this issue 1 year ago • 7 comments

We need at least:

  • [ ] ULE impls with tests
  • [ ] Plural selection support

CC @younies

sffc avatar Jun 27 '24 14:06 sffc

For the plural selection support, I think the most straightforward option is:

  1. icu_pattern takes an optional dependency on zerovec
  2. icu_pattern gets a Count enum similar to the ones @younies added in the dimension crate
  3. A new type in icu_pattern sits on top of a ZeroMap-like structure (probably without the ZeroMap itself) with the invariant that there is always at least one pattern, enforcing this invariant when deserializing and constructing
  4. The new type can work with any other pattern variant

sffc avatar Jul 12 '24 17:07 sffc

An abstraction in zerovec that would be a useful building block would be VarZeroDefaultVec with the following behavior:

  • Has the same byte representation as VarZeroVec or removes the first entry from the index list (given how gnarly the VarZeroVec code is, I'm inclined to use the same byte representation)
  • Checks that there is at least 1 element when deserializing, otherwise returning a VarZeroVecFormatError
  • Has an infallible function first()

This would be helpful not only here but also in other cases where we have a vec that we assume always has at least 1 item; we could get rid of some GIGO cases.

We can also add ZeroDefaultVec with similar behavior.

Or we could use MultiFieldsULE: https://github.com/unicode-org/icu4x/issues/5240

Thoughts @Manishearth?

sffc avatar Jul 12 '24 21:07 sffc

The abstraction feels rather specific to me, or perhaps the name VarZeroDefaultVec doesn't really signal anything to me. Not opposed, but not convinced either.

Manishearth avatar Jul 12 '24 22:07 Manishearth

I came up with the name DefaultVec based on Python defaultdict which is a structure where data access is always successful, returning a default value.

sffc avatar Jul 12 '24 23:07 sffc

Yeah, that connection doesn't come easily to me from the name, and DefaultDict is less "at least one" and more "getter has a default".

I think a thing that would be nice here is the ability to use a custom validator function in make_ule/varule, which means you can get this kind of deserialization check without unsafe. The rest doesn't seem worth it to me so far.

Manishearth avatar Jul 12 '24 23:07 Manishearth

I don't think we should shoehorn this into zerovec, we should just use a custom encoding. This would be a lot less work, and there's more optimisation potential: the other case is mandatory, the number of cases is so low that we don't need binary search, we can bitpack much better, etc.

robertbastian avatar Jul 16 '24 11:07 robertbastian

Ok I'll try to make it specific to icu_pattern.

sffc avatar Jul 16 '24 15:07 sffc