Add iterators over UTF/Latin1 slices that fuse a CodePointTrie lookup
The special boundary values of CodePointTrie are designed to line up with special boundaries in UTF-8 and, in the case of the fast mode, with the special boundary in UTF-16.
(For both trie types, 1-byte UTF-8, i.e. ASCII, can go directly to the data slice. For the fast trie type, the special boundary lines up with the boundary between 1-code-unit and 2-code-unit UTF-16 sequences and the boundary between 3-code-unit and 4-code-unit UTF-8 sequences. For the small type, the special boundary lines up with the boundary between 2-code-unit and 3-code-unit UTF-8 sequences.)
When we iterate over a slice of textual data by char, we have to re-execute branches that UTF decode already executed.
We should have iterators over str, potentially-not-well-formed UTF-8, UTF-16, and Latin1 that take a reference to a TypedCodePointTrie in addition to taking a slice and yield pairs of char and TrieValue. (Or perhaps we need a trait that also untyped CodePointTrie implements.)
For str and Latin1, these belong in the ICU4X repo itself. For potentially-not-well-formed UTF-8 and UTF-16, perhaps the utf8_iter and utf16_iter crates should get a feature that brings is icu_collections as an optional dependency to enable fusing the trie lookup into the UTF decoding.
For the small type, the special boundary lines up with the boundary between 2-code-unit and 3-code-unit UTF-8 sequences.
Oops. That's not correct. With the small trie mode, the special boundary is U+1000 whereas the 3-code-unit UTF-8 sequences start at U+0800. It is the case, though, that two-byte UTF-8 sequences can always take a branchless path through CodePointTrie after the UTF-8 decoder has established that it has a proper two-byte sequence.
I think this requires adding four new public getters for the three kinds CodePointTries:
- Latin1 getter (
u8argument; safe) - ASCII getter (unsafe with the safety invariant that the
u8must not have the highest bit set) - 2-byte UTF-8 getter taking low six bits and high five bits as two distinct integers (most likely
u32s) (unsafe with the safety invariant that the low six bits integer must not have bits other than the low six set and the high five bits integer must not have bits other than the low five bits set) - 3-byte UTF-8 getter taking low six bits and high 10 bits as two distinct integers (most likely
u32s) (unsafe with the safety invariant that the low six bits integer must not have bits other than the low six set and the high ten bits integer must not have bits other than the low ten bits set)
It's not great to introduce 3 unsafe getters, but I don't have better ideas given the performance goals. The safety invariants could and would be debug_assert!ed.
@Manishearth, Do you think the above unsafe API would be acceptable? A correct UTF-8 decoder that uses the getters as intended will pretty much have to have the right kind of bit masking (and in the three-byte case, shifting) that satisfies the constraints. I'm not aware of a Rust mechanism that would allow encoding the constraints in Rust types without branches or unsafe at a different point. Is there a better way than stating the above safety invariants and debug_assert!ing them?
Some old notes from a WG discussion:
- Henri wishes to add unsafe code to CPT
- Manish is worried about unsafe invariants that rely on the correctness of CPT algorithms
- Henri explains that these are like the existing invariants
- Manish is fine with this, may need to check on deserialization perf, but it's probably not an issue
- Henri says there's going to be only one more branch on the deserialization path (minimum length for the data array)
Do you think the above unsafe API would be acceptable? A correct UTF-8 decoder that uses the getters as intended will pretty much have to have the right kind of bit masking (and in the three-byte case, shifting) that satisfies the constraints
Hmm, I think I never responded to this.
I can't really give a good answer without knowing how the invariants work. Generally I wish invariants to be as local as possible: easily tracked through the code with comments so that you're never unsure if they are upheld. I'd have to see the code.
Sadly, https://github.com/rust-lang/rust/issues/110998 still isn't in stable, so can't use core::ascii::Char as an argument type to avoid an unsafe invariant on u8.
Apparently, there's https://github.com/rust-lang/rust/issues/123646 that could safely describe the API for two and three-byte UTF-8 accessors, but it's not on stable, either.
I can't really give a good answer without knowing how the invariants work. Generally I wish invariants to be as local as possible: easily tracked through the code with comments so that you're never unsure if they are upheld. I'd have to see the code.
Draft cptrie code: https://github.com/unicode-org/icu4x/pull/7279
Tests and callers not written, yet.