icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Investigate removing scalar value validation from HarfBuzz canonical composition and decomposition

Open hsivonen opened this issue 3 years ago • 8 comments

The initial glue code for providing the canonical composition and canonical decomposition operation to HarfBuzz assumes that the incoming hb_codepoint_t values might not be valid scalar values.

We should investigate avoiding the validation overhead by:

  1. Investigating what HarfBuzz itself guarantees.
  2. Considering making the relevant ICU4X APIs have u32-argument alternatives to the char-argument methods.

hsivonen avatar Nov 22 '22 08:11 hsivonen

For decomposition, it should be fine to make the ICU4X code operate on u32 and to make the current char-argument version a convenience wrapper on top of that.

For composition, ICU4X internals seem sufficiently reliant on having assurances of scalar valueness that either we should validate the inputs provided by HarfBuzz or take the position that if HarfBuzz passes non-scalar values that's the kind of error in using the HarfBuzz API that UB is acceptable.

hsivonen avatar Nov 24 '22 12:11 hsivonen

or take the position that if HarfBuzz passes non-scalar values that's the kind of error in using the HarfBuzz API that UB is acceptable.

The HarfBuzz API for bypassing input validation now says the caller is responsible for passing in scalar values into HarfBuzz: https://github.com/harfbuzz/harfbuzz/commit/f2297e6978087cefd27a947e044adf9e89eab5f6

We could now take the position that calling hb_buffer_add_codepoints is conceptually like calling Rust's core::str::from_utf8_unchecked.

hsivonen avatar Nov 28 '22 08:11 hsivonen

I don't think it's worth over-engineering this to avoid a validation which is essentially free.

CC @echeran

sffc avatar Dec 22 '22 17:12 sffc

...either we should validate the inputs provided by HarfBuzz or take the position that if HarfBuzz passes non-scalar values that's the kind of error in using the HarfBuzz API that UB is acceptable.

That sounds reasonable. As I see it, either way, it's not an error that is actionable by the user, so UB without crashing seems acceptable indeed. FWIW, I think this is also in line with the proposed rules of thumb in the data safety PR.

We could now take the position that calling hb_buffer_add_codepoints is conceptually like calling Rust's core::str::from_utf8_unchecked.

That makes sense. And at least we're passing Rust chars, so we're taking care of the newly clarified requirement for callers.

echeran avatar Jan 21 '23 02:01 echeran

@hsivonen Thoughts on the latest comments from myself and @echeran?

sffc avatar Feb 02 '23 18:02 sffc

Ping @hsivonen

sffc avatar Feb 16 '23 18:02 sffc

Sorry about missing that this was blocking on me. My preference is that we trust that scalar values that we receive from HarfBuzz are indeed scalar values and, therefore, can be interpreted as Rust char without validation.

hsivonen avatar Dec 01 '23 07:12 hsivonen

Ok, seems reasonable. In any case I'd prefer to fix this in the harfbuzz glue code in the upstream repo so that we avoid interacting with raw harfbuzz types in the icu repo. #3257

sffc avatar Dec 01 '23 21:12 sffc