icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Add iterators over UTF/Latin1 slices that fuse a CodePointTrie lookup

Open hsivonen opened this issue 4 months ago • 6 comments

The special boundary values of CodePointTrie are designed to line up with special boundaries in UTF-8 and, in the case of the fast mode, with the special boundary in UTF-16.

(For both trie types, 1-byte UTF-8, i.e. ASCII, can go directly to the data slice. For the fast trie type, the special boundary lines up with the boundary between 1-code-unit and 2-code-unit UTF-16 sequences and the boundary between 3-code-unit and 4-code-unit UTF-8 sequences. For the small type, the special boundary lines up with the boundary between 2-code-unit and 3-code-unit UTF-8 sequences.)

When we iterate over a slice of textual data by char, we have to re-execute branches that UTF decode already executed.

We should have iterators over str, potentially-not-well-formed UTF-8, UTF-16, and Latin1 that take a reference to a TypedCodePointTrie in addition to taking a slice and yield pairs of char and TrieValue. (Or perhaps we need a trait that also untyped CodePointTrie implements.)

For str and Latin1, these belong in the ICU4X repo itself. For potentially-not-well-formed UTF-8 and UTF-16, perhaps the utf8_iter and utf16_iter crates should get a feature that brings is icu_collections as an optional dependency to enable fusing the trie lookup into the UTF decoding.

hsivonen avatar Sep 08 '25 14:09 hsivonen

For the small type, the special boundary lines up with the boundary between 2-code-unit and 3-code-unit UTF-8 sequences.

Oops. That's not correct. With the small trie mode, the special boundary is U+1000 whereas the 3-code-unit UTF-8 sequences start at U+0800. It is the case, though, that two-byte UTF-8 sequences can always take a branchless path through CodePointTrie after the UTF-8 decoder has established that it has a proper two-byte sequence.

I think this requires adding four new public getters for the three kinds CodePointTries:

  • Latin1 getter (u8 argument; safe)
  • ASCII getter (unsafe with the safety invariant that the u8 must not have the highest bit set)
  • 2-byte UTF-8 getter taking low six bits and high five bits as two distinct integers (most likely u32s) (unsafe with the safety invariant that the low six bits integer must not have bits other than the low six set and the high five bits integer must not have bits other than the low five bits set)
  • 3-byte UTF-8 getter taking low six bits and high 10 bits as two distinct integers (most likely u32s) (unsafe with the safety invariant that the low six bits integer must not have bits other than the low six set and the high ten bits integer must not have bits other than the low ten bits set)

It's not great to introduce 3 unsafe getters, but I don't have better ideas given the performance goals. The safety invariants could and would be debug_assert!ed.

@Manishearth, Do you think the above unsafe API would be acceptable? A correct UTF-8 decoder that uses the getters as intended will pretty much have to have the right kind of bit masking (and in the three-byte case, shifting) that satisfies the constraints. I'm not aware of a Rust mechanism that would allow encoding the constraints in Rust types without branches or unsafe at a different point. Is there a better way than stating the above safety invariants and debug_assert!ing them?

hsivonen avatar Sep 09 '25 10:09 hsivonen

Some old notes from a WG discussion:

  • Henri wishes to add unsafe code to CPT
  • Manish is worried about unsafe invariants that rely on the correctness of CPT algorithms
  • Henri explains that these are like the existing invariants
  • Manish is fine with this, may need to check on deserialization perf, but it's probably not an issue
  • Henri says there's going to be only one more branch on the deserialization path (minimum length for the data array)

sffc avatar Nov 10 '25 19:11 sffc

Do you think the above unsafe API would be acceptable? A correct UTF-8 decoder that uses the getters as intended will pretty much have to have the right kind of bit masking (and in the three-byte case, shifting) that satisfies the constraints

Hmm, I think I never responded to this.

I can't really give a good answer without knowing how the invariants work. Generally I wish invariants to be as local as possible: easily tracked through the code with comments so that you're never unsure if they are upheld. I'd have to see the code.

Manishearth avatar Nov 10 '25 19:11 Manishearth

Sadly, https://github.com/rust-lang/rust/issues/110998 still isn't in stable, so can't use core::ascii::Char as an argument type to avoid an unsafe invariant on u8.

hsivonen avatar Dec 04 '25 18:12 hsivonen

Apparently, there's https://github.com/rust-lang/rust/issues/123646 that could safely describe the API for two and three-byte UTF-8 accessors, but it's not on stable, either.

hsivonen avatar Dec 05 '25 07:12 hsivonen

I can't really give a good answer without knowing how the invariants work. Generally I wish invariants to be as local as possible: easily tracked through the code with comments so that you're never unsure if they are upheld. I'd have to see the code.

Draft cptrie code: https://github.com/unicode-org/icu4x/pull/7279

Tests and callers not written, yet.

hsivonen avatar Dec 05 '25 08:12 hsivonen