icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Support Unicode 15.1 for line segmenter

Open makotokato opened this issue 1 year ago • 1 comments

This fix supports Unicode 15.1 for line segmenter. (a part of https://github.com/unicode-org/icu4x/issues/3255)

  • Added LB15a handling to engine code since LB15a references previous breaking rules.
  • Adding fallback handling into intermediate status for LB28a
  • Modified tests like https://github.com/unicode-org/icu4x/pull/4389

makotokato avatar Jul 10 '24 11:07 makotokato

r? @eggrobin for the algorithm

Manishearth avatar Jul 11 '24 17:07 Manishearth

I addressed the comments I had made above, moving the LB15a-specific logic into the state table (it is not exactly pretty, we should probably invest in some extensions to the state table description language; but it is nowhere near as bad as the ZWJ-CM situation was). I also found a few bugs and addressed them, testing this with 200 000 monkeys (last bug found at 8208). I should probably try 2 000 000 as I did in https://github.com/unicode-org/icu4x/pull/4389.

eggrobin avatar Sep 04 '24 15:09 eggrobin

I should probably try 2 000 000 as I did in https://github.com/unicode-org/icu4x/pull/4389.

fatal error: rustc does not support files larger than 4GB

Maybe we should be using something other than include_str! here.

eggrobin avatar Sep 04 '24 16:09 eggrobin

Now tested with 2 000 000 random strings. Before ICU 76 (specifically, before https://github.com/unicode-org/icu/pull/3028) it doesn’t make sense to go further than that, since it turned out the PRNG was cycling around that length (see L2/24-162 §5.6).

eggrobin avatar Sep 04 '24 18:09 eggrobin