icu4x
icu4x copied to clipboard
Consider supporting 1, 2, 4, and 24-bit trie values
The trie builder always operates on 32-bit values and can then narrow the main backing array value to 8 or 16 bits at serialization time.
We already use a byte array as unaligned backing storage. We should consider extending the way the reads by index map to the backing byte array a little to support more compact value widths:
If the byte array had one extra byte at the end, we could use 32-bit unaligned loads to read 24-bit values (masking off the highest 8 bits) without going out of bounds. See also #4669.
For 1, 2, and 4-bit values, we could shift and mask the index to read smaller parts of bytes from an array that was 1/8, 1/4, or 1/2 in byte length compared to using 8 bits as the narrowest value.
1 bits is useful for accessing a binary property faster than from a fragmented inversion list.
2 bits is useful for bundling two co-occurring binary properties.
4 bits is useful for enumerated properties with few distinct values, e.g. Joining_Type.
24 bits is useful for scalar values.
Some thoughts:
- CodePointTrie currently just wraps a ZeroVec for accessing values. ZeroVec transformations are easy to do in Rust
- CodePointTrie could have a getter function that returns an
enum { DefaultValue, ErrorValue, Index(usize) }and completely remove the type parameter from CodePointTrie, similar to how ZeroTrie works
I'll put this in the 2.0 milestone, but it isn't super-high priority and it could slip to 3.0.
Marking as a non-blocker in the 2.0 dashboard. Unlikely to make 2.0
unicode-ident cites https://github.com/rust-lang/rust/pull/33098/files for a data structure that's similar to CodePointTrie with single-bit values.
In terms of goal formulation, if we want icu_properties to be the single source of Unicode Database data for apps, it should be possible to make unicode-ident use icu_properties without regressing performance. For that, we'd need a https://github.com/rust-lang/rust/pull/33098/files -like alternative for inversion lists for binary properties.