icu4x Make the normalizer work with new Unicode 16 normalization behaviors

Make the normalizer work with new Unicode 16 normalization behaviors

Open hsivonen opened this issue 9 months ago • 5 comments

Closes #4859

ICU4C PR coming up.

May 02 '24 15:05 hsivonen

ICU4C PR: https://github.com/unicode-org/icu/pull/2994

May 02 '24 15:05 hsivonen

This reverts https://github.com/unicode-org/icu4x/pull/4538 , which turned out to be a bad idea.

May 02 '24 15:05 hsivonen

I've tested this with Unicode 16 data, but this PR doesn't include the new data.

May 02 '24 15:05 hsivonen

@hsivonen - PR #4860 adds compatibility with Unicode 16, but it potentially regresses performance on C forms (not NFD or NFKD) due to additional branches, but this hasn't been benchmarked. PR #4878 should not have any performance impact.
@echeran - I think we should keep Unicode 16 changes together.
@sffc - What types of characters hit the branch?
@hsivonen - It impacts the performance of the pass-through operation for NFC.
@sffc - It seems not super urgent to lang these since Unicode 16 isn't out yet, but we should do the work to test and benchmark them.
@hsivonen - We would want a long string of CJ text and want that to go fast.

Conclusion: spend the time to benchmark these changes, in conjunction with #4967. Do this before 2.0 because it might involve a data struct change.

May 30 '24 18:05 sffc

We should merge this and https://github.com/unicode-org/icu4x/pull/4878 now regardless of the outcome of https://github.com/unicode-org/icu4x/issues/4967 . If the outcome of that investigation shows that it makes sense to rearrange the bits, let's land that change on top of this one.

Sep 18 '24 08:09 hsivonen

icu4x icu4x copied to clipboard

Make the normalizer work with new Unicode 16 normalization behaviors

icu4x
icu4x copied to clipboard