icu4x
icu4x copied to clipboard
Word segmentation is incorrect
WB3c and WB3c interact in the same way LB8a and LB9 do. A correct implementation of that would require either duplicating every state as in https://github.com/unicode-org/icu4x/pull/4389, or hoisting the two rules into the logic as in https://github.com/unicode-org/icu4x/pull/5001.
The latter seems more attractive, both for data size and sanity of the maintainer; note that since rule_segmenter.rs is shared with extended grapheme cluster and sentence breaking, this will require passing a flag for that logic.
@eggrobin What is left on this issue?
What is left on this issue?
All of it? It was created to allow us to close the specific issue reported in https://github.com/unicode-org/icu4x/issues/4417, but word segmentation is still wrong and hasn’t changed since this was filed.
I thought that this was fixed when I added random tests in all UAX#29 segmenter in 2.0 development cycles.
Yes, I see https://github.com/unicode-org/icu4x/pull/6442 fixed this (by duplicating most of the states rather than hoisting the ZWJ handling into the code).