icu4x
icu4x copied to clipboard
Investigate removing scalar value validation from HarfBuzz canonical composition and decomposition
The initial glue code for providing the canonical composition and canonical decomposition operation to HarfBuzz assumes that the incoming hb_codepoint_t values might not be valid scalar values.
We should investigate avoiding the validation overhead by:
- Investigating what HarfBuzz itself guarantees.
- Considering making the relevant ICU4X APIs have
u32-argument alternatives to thechar-argument methods.
For decomposition, it should be fine to make the ICU4X code operate on u32 and to make the current char-argument version a convenience wrapper on top of that.
For composition, ICU4X internals seem sufficiently reliant on having assurances of scalar valueness that either we should validate the inputs provided by HarfBuzz or take the position that if HarfBuzz passes non-scalar values that's the kind of error in using the HarfBuzz API that UB is acceptable.
or take the position that if HarfBuzz passes non-scalar values that's the kind of error in using the HarfBuzz API that UB is acceptable.
The HarfBuzz API for bypassing input validation now says the caller is responsible for passing in scalar values into HarfBuzz: https://github.com/harfbuzz/harfbuzz/commit/f2297e6978087cefd27a947e044adf9e89eab5f6
We could now take the position that calling hb_buffer_add_codepoints is conceptually like calling Rust's core::str::from_utf8_unchecked.
I don't think it's worth over-engineering this to avoid a validation which is essentially free.
CC @echeran
...either we should validate the inputs provided by HarfBuzz or take the position that if HarfBuzz passes non-scalar values that's the kind of error in using the HarfBuzz API that UB is acceptable.
That sounds reasonable. As I see it, either way, it's not an error that is actionable by the user, so UB without crashing seems acceptable indeed. FWIW, I think this is also in line with the proposed rules of thumb in the data safety PR.
We could now take the position that calling
hb_buffer_add_codepointsis conceptually like calling Rust'score::str::from_utf8_unchecked.
That makes sense. And at least we're passing Rust chars, so we're taking care of the newly clarified requirement for callers.
@hsivonen Thoughts on the latest comments from myself and @echeran?
Ping @hsivonen
Sorry about missing that this was blocking on me. My preference is that we trust that scalar values that we receive from HarfBuzz are indeed scalar values and, therefore, can be interpreted as Rust char without validation.
Ok, seems reasonable. In any case I'd prefer to fix this in the harfbuzz glue code in the upstream repo so that we avoid interacting with raw harfbuzz types in the icu repo. #3257