icu4x
icu4x copied to clipboard
Make the normalizer work with new Unicode 16 normalization behaviors
Closes #4859
ICU4C PR coming up.
ICU4C PR: https://github.com/unicode-org/icu/pull/2994
This reverts https://github.com/unicode-org/icu4x/pull/4538 , which turned out to be a bad idea.
I've tested this with Unicode 16 data, but this PR doesn't include the new data.
- @hsivonen - PR #4860 adds compatibility with Unicode 16, but it potentially regresses performance on C forms (not NFD or NFKD) due to additional branches, but this hasn't been benchmarked. PR #4878 should not have any performance impact.
- @echeran - I think we should keep Unicode 16 changes together.
- @sffc - What types of characters hit the branch?
- @hsivonen - It impacts the performance of the pass-through operation for NFC.
- @sffc - It seems not super urgent to lang these since Unicode 16 isn't out yet, but we should do the work to test and benchmark them.
- @hsivonen - We would want a long string of CJ text and want that to go fast.
Conclusion: spend the time to benchmark these changes, in conjunction with #4967. Do this before 2.0 because it might involve a data struct change.
We should merge this and https://github.com/unicode-org/icu4x/pull/4878 now regardless of the outcome of https://github.com/unicode-org/icu4x/issues/4967 . If the outcome of that investigation shows that it makes sense to rearrange the bits, let's land that change on top of this one.