icu4x Word segmentation is incorrect

Word segmentation is incorrect

Open robertbastian opened this issue 1 year ago • 2 comments

WB3c and WB3c interact in the same way LB8a and LB9 do. A correct implementation of that would require either duplicating every state as in https://github.com/unicode-org/icu4x/pull/4389, or hoisting the two rules into the logic as in https://github.com/unicode-org/icu4x/pull/5001.

The latter seems more attractive, both for data size and sanity of the maintainer; note that since rule_segmenter.rs is shared with extended grapheme cluster and sentence breaking, this will require passing a flag for that logic.

Jun 06 '24 12:06 robertbastian

@eggrobin What is left on this issue?

Sep 17 '24 17:09 sffc

What is left on this issue?

All of it? It was created to allow us to close the specific issue reported in https://github.com/unicode-org/icu4x/issues/4417, but word segmentation is still wrong and hasn’t changed since this was filed.

Sep 17 '24 18:09 eggrobin

I thought that this was fixed when I added random tests in all UAX#29 segmenter in 2.0 development cycles.

Oct 23 '25 02:10 makotokato

Yes, I see https://github.com/unicode-org/icu4x/pull/6442 fixed this (by duplicating most of the states rather than hoisting the ZWJ handling into the code).

Oct 23 '25 14:10 eggrobin

icu4x icu4x copied to clipboard

Word segmentation is incorrect

icu4x
icu4x copied to clipboard