icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Consider supporting 1, 2, 4, and 24-bit trie values

Open hsivonen opened this issue 1 year ago • 2 comments

The trie builder always operates on 32-bit values and can then narrow the main backing array value to 8 or 16 bits at serialization time.

We already use a byte array as unaligned backing storage. We should consider extending the way the reads by index map to the backing byte array a little to support more compact value widths:

If the byte array had one extra byte at the end, we could use 32-bit unaligned loads to read 24-bit values (masking off the highest 8 bits) without going out of bounds. See also #4669.

For 1, 2, and 4-bit values, we could shift and mask the index to read smaller parts of bytes from an array that was 1/8, 1/4, or 1/2 in byte length compared to using 8 bits as the narrowest value.

1 bits is useful for accessing a binary property faster than from a fragmented inversion list. 2 bits is useful for bundling two co-occurring binary properties. 4 bits is useful for enumerated properties with few distinct values, e.g. Joining_Type. 24 bits is useful for scalar values.

hsivonen avatar Mar 08 '24 12:03 hsivonen

Some thoughts:

  • CodePointTrie currently just wraps a ZeroVec for accessing values. ZeroVec transformations are easy to do in Rust
  • CodePointTrie could have a getter function that returns an enum { DefaultValue, ErrorValue, Index(usize) } and completely remove the type parameter from CodePointTrie, similar to how ZeroTrie works

sffc avatar Jul 24 '24 16:07 sffc

I'll put this in the 2.0 milestone, but it isn't super-high priority and it could slip to 3.0.

sffc avatar Jul 24 '24 16:07 sffc

Marking as a non-blocker in the 2.0 dashboard. Unlikely to make 2.0

Manishearth avatar Oct 23 '24 18:10 Manishearth

unicode-ident cites https://github.com/rust-lang/rust/pull/33098/files for a data structure that's similar to CodePointTrie with single-bit values.

In terms of goal formulation, if we want icu_properties to be the single source of Unicode Database data for apps, it should be possible to make unicode-ident use icu_properties without regressing performance. For that, we'd need a https://github.com/rust-lang/rust/pull/33098/files -like alternative for inversion lists for binary properties.

hsivonen avatar Sep 04 '25 07:09 hsivonen