icu4x
icu4x copied to clipboard
Holistic zerovec abstraction design
See also: https://github.com/unicode-org/icu4x/issues/5561 for a holistic design for the core vector types
Jumping off of https://github.com/unicode-org/icu4x/pull/5520#issuecomment-2338661326
I think we should try to make our zerovec
abstractions a bit more holistically designed. I have these goals in mind:
- The
unsafe
code in the crate should be well organized and minimal. - The abstractions in the crate should be at the right level of genericity. Unsafe abstractions that are super generic tend to be harder to review and safely maintain, but if done right they can cover a lot more use cases without complicating things too much.
- Users of the crate should almost always be able to avoid custom impls of ULE/VarULE/etc unless they're trying to do really clever packing[^1]. This means that these abstractions should provide acceptably optimal data size. It is okay if there are theoretically better ways of doing things as long as we stay reasonably close to the theoretical maximum.
- Users of the crate should usually have a way to avoid using custom derives, though we should still maintain useable and versatile custom derives. What ICU4X chooses to do here ought to be a separate discussion.
For the purposes of this discussion, ICU4X is considered a typical large-scale user, as it is the primary user.
The design I've had in mind is the following. It roughly matches @sffc's design in https://github.com/unicode-org/icu4x/pull/5520#issuecomment-2338661326 but goes in more depth on how the unsafe code should be structured.
- Step 1: Improve structure of unsafe code
- [x] Refactor VarZeroVec indexing, construction code to be reusable with out-of-band length https://github.com/unicode-org/icu4x/issues/5378 (planned for 2.0)
- A strong goal here is that I want the code dealing with the VZV buffer to basically be in one place. We've iterated a bunch of times on how we'd like that buffer to eventually work, and this code is messy and previously was hard to review, so consolidating it is quite important to me. I even find the gnarly mutation code to be something that's worth avoiding if possible since it's another place where tricky VZV buffer invariants live, but perhaps that's too big an ask and the mutation is useful. ICU4X doesn't use it, we always use the wholesale conversion code that constructs from a slice/vec.
- [x] Update MultiFieldsULE to use this, and supply length from const param https://github.com/unicode-org/icu4x/issues/5240 (planned for 2.0)
- [ ] Potentially revisit the need to maintain VZV mutation code (unplanned for 2.0. worth discussing quickly). https://github.com/unicode-org/icu4x/issues/5583
- [x] Refactor VarZeroVec indexing, construction code to be reusable with out-of-band length https://github.com/unicode-org/icu4x/issues/5378 (planned for 2.0)
- Step 2: Perform optimizations enabled by the codesharing
- [x] Remove the zero-length from the VarZeroVec index array https://github.com/unicode-org/icu4x/issues/5516
- [ ] Make VarZeroVec branchless https://github.com/unicode-org/icu4x/issues/5507 (unplanned for 2.0. we should ideally make stuff private in 2.0 so we can make the change later)
- [ ] Make VarZeroVec no use smaller length widths for smaller formats
- Step 3: Actual holistic abstractions. We should end up with:
- [x]
ZeroVec<T>
,VarZeroVec<T, Format>
, plus slice types (exists) - [x] (public but largely "internal")
MultiFieldsULE<N, Format>
type. Would enjoy suggestions on the name[^2] since it's more than just fields. MaybeUntypedMultiVarULE
. (planned for 2.0) - [ ] Macros for simple wrappers for both ULE and VarULE. https://github.com/unicode-org/icu4x/issues/5127 (unplanned for 2.0, but is a purely internal implementation change that can happen whenever we want)
- [x]
VarTupleULE
that represents(ULE, VarULE)
. (already exists, would like name suggestions). - [x]
TupleNULE
that represents(ULE, ULE, ...)
. We've done a good job minimizing unsafe here. (already exists, would like name suggestions) - [ ]
VarTupleNULE
that represents(VarULE, VarULE, ...)
. Implemented usingMultiFieldsULE
, basically a typed version of the same. Should not have too much unsafe, ideally using macros or something to do the N-tuples. I'd like to have some of this on track for 2.0, I'm okay with not getting all of it but I want to try. - [x]
OptionULE
,OptionVarULE
that representOption<T>
. We already have these, but I'd like to potentially switch these over to usingVarTupleULE
in their internal implementation. (No change planned for 2.0, nice-to-have)
- [x]
- Step 4 (future, not yet designed): Better bitpacking abstractions. In the long run, I'd like us to do a similar exercise for generalizing our bitpacking work. I haven't put much design thought into this so I'm not going to actually write a full design, nor do I wish to milestone this at the moment, but it's worth giving a rough sketch:
- [ ] We should have more generic types that work like
NichedOption
, but aren't just Option. Ideally they let us do more complex enums. - [ ] It should be possible to do more bitpacky things with the VarZeroVec array type.
- [ ] It should be possible to do more bitpacky things with bitpacky TupleULE types and the custom derive.
- [ ] We should have more generic types that work like
I am not fully happy with the names of these generic types, but we can bikeshed and we can punt naming past 2.0. I would like to see if we can come up with something better than TupleN
, VarTuple
, and VarTupleN
(or VarVar
). I don't really find VarVar
to be evocative of the right thing, it sounds nested rather than just being "two Var
s". TupleN
and VarTupleN
sound about right to me, but we then need a name for the (ULE, VarULE)
type. Maybe MixedTuple
?
Overall I think this is work I can get done in time for 2.0, at least sufficiently so that we're not blocked on stable data formats in the future, even if the zerovec APIs may change a bit.
cc @sffc @robertbastian
[^1]: In the long run it would be nice to support that as well, for now I am comfortable not trying to cover that too much in the current holistic design. [^2]: I don't consider the bikesheds here to be 2.0 blockers, but I do wish to solicit name suggestions.