icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Make the normalizer work with new Unicode 16 normalization behaviors

Open hsivonen opened this issue 9 months ago • 5 comments

Closes #4859

ICU4C PR coming up.

hsivonen avatar May 02 '24 15:05 hsivonen

ICU4C PR: https://github.com/unicode-org/icu/pull/2994

hsivonen avatar May 02 '24 15:05 hsivonen

This reverts https://github.com/unicode-org/icu4x/pull/4538 , which turned out to be a bad idea.

hsivonen avatar May 02 '24 15:05 hsivonen

I've tested this with Unicode 16 data, but this PR doesn't include the new data.

hsivonen avatar May 02 '24 15:05 hsivonen

  • @hsivonen - PR #4860 adds compatibility with Unicode 16, but it potentially regresses performance on C forms (not NFD or NFKD) due to additional branches, but this hasn't been benchmarked. PR #4878 should not have any performance impact.
  • @echeran - I think we should keep Unicode 16 changes together.
  • @sffc - What types of characters hit the branch?
  • @hsivonen - It impacts the performance of the pass-through operation for NFC.
  • @sffc - It seems not super urgent to lang these since Unicode 16 isn't out yet, but we should do the work to test and benchmark them.
  • @hsivonen - We would want a long string of CJ text and want that to go fast.

Conclusion: spend the time to benchmark these changes, in conjunction with #4967. Do this before 2.0 because it might involve a data struct change.

sffc avatar May 30 '24 18:05 sffc

We should merge this and https://github.com/unicode-org/icu4x/pull/4878 now regardless of the outcome of https://github.com/unicode-org/icu4x/issues/4967 . If the outcome of that investigation shows that it makes sense to rearrange the bits, let's land that change on top of this one.

hsivonen avatar Sep 18 '24 08:09 hsivonen