icu4x
icu4x copied to clipboard
Investigate opportunities to further optimize CodePointTrie
CodePointTrie is already highly optimized, but someone should look at it with fresh eyes and determine whether there is any low-hanging fruit that could make it even faster, e.g. for the purposes of the Segmenter.
Blocked on https://github.com/unicode-org/icu4x/issues/1847
Some additional thoughts on this: CPT performance is especially critical for Segmenter. Prior to #1839, Segmenter was using a fully expanded table covering almost all code points (1 byte per code point, up to approximately 900'000 code points). Transitioning to the more compact CPT representation drastically reduced memory usage, at the expense of some performance.
To regain the performance for segmentation, I see a few paths forward:
- See if we can improve CPT itself: can we optimize the lookup algorithm? Inline more code? Replace expensive primitives with faster ones?
- If we confirm that CPT is maximally performant, should we consider adding an additional mode to CPT, like "extrafast", which is even faster than "fast", providing one-stage lookup for low-BMP code points at the expense of more data size?
- Alternatively, if we don't want to clutter the CPT code, we could make the segmenter data an enum between CPT and the fully expanded property table?
I suggest creating a borrowed counterpart of CodePointTrie that holds ZeroSlices instead of ZeroVecs, making CodePointTrie to deref into that type and moving the lookup code onto the borrowed type. This way, the normalizer could hold the borrow type by value instead of holding a reference to CodePointTrie itself and could avoid the ZeroVec discriminant branches on every lookup.
See also #2410
Adding the discuss-priority label to discuss whether the idea from my previous comment would be a breaking change after 1.0 and, therefore, would have to be done before 1.0.
Probably how I'd like to do this would be to introduce something like CodePointTrieWithStore<T, S> where S: Deref<[T::ULE]> and then define type CodePointTrie<T> = CodePointTrieWithStore<T, ZeroVec<T>>
Discussion:
- @sffc - Is this something we can do post 1.0?
- @Manishearth - If we typedef it? Yes... it should work, but it's a corner of Rust that behaves weirdly. Typedefs with type parameters are weird
- @robertbastian - I think it won't let you create one without a generic
- @Manishearth - Is Deref the right trait?
- @sffc - Probably not. We can discuss the exact bound later.
Conclusion: This is doable post 1.0
FYI
- CPT stores data for ASCII characters (0..7F) in a linear array that can be indexed directly.
- A "fast" CPT uses a simple 2-stage lookup for all of the BMP. I don't know how to make this better without using a large, linear array for the BMP.
- A "fast" CPT supports an optimized from-UTF-8-bytes lookup of BMP characters where the last trail byte (which carries the least significant 6 bits of the code point) is used all and only for indexing into the last ("data") array. No need to assemble the full code point if you don't otherwise need it.
- For details see the ICU4C macro UCPTRIE_FAST_U8_NEXT().
- A "small" CPT supports this for 2-byte UTF-8 characters (0080..07FF) [plus a sliver of 3-byters but probably not worth it].
And we may want to revisit normalizer / collator to see if we can get performance gains by utilizing any ICU4X port of ICU's optimized special case APIs for getting values from CPT for UTF-8 / UTF-16.
(FWIW, IIUC, the ICU implementations of normalizer / collator do make use of this strategy of using the code units directly to traverse the CPT rather than construct the whole code point and then let CPT break it down again to traverse.)