Support Unicode 15.1 for line segmenter
This fix supports Unicode 15.1 for line segmenter. (a part of https://github.com/unicode-org/icu4x/issues/3255)
- Added LB15a handling to engine code since LB15a references previous breaking rules.
- Adding fallback handling into intermediate status for LB28a
- Modified tests like https://github.com/unicode-org/icu4x/pull/4389
r? @eggrobin for the algorithm
I addressed the comments I had made above, moving the LB15a-specific logic into the state table (it is not exactly pretty, we should probably invest in some extensions to the state table description language; but it is nowhere near as bad as the ZWJ-CM situation was). I also found a few bugs and addressed them, testing this with 200 000 monkeys (last bug found at 8208). I should probably try 2 000 000 as I did in https://github.com/unicode-org/icu4x/pull/4389.
I should probably try 2 000 000 as I did in https://github.com/unicode-org/icu4x/pull/4389.
fatal error: rustc does not support files larger than 4GB
Maybe we should be using something other than include_str! here.
Now tested with 2 000 000 random strings. Before ICU 76 (specifically, before https://github.com/unicode-org/icu/pull/3028) it doesn’t make sense to go further than that, since it turned out the PRNG was cycling around that length (see L2/24-162 §5.6).