icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Dictionary word segment error

Open Delsart opened this issue 3 months ago • 2 comments

there tree entries in cjdict.txt

70 觉 110 气 99

when encountering a string like "意味觉". since 觉(110) > 意(70),so we expect segment:"意","味觉",the actual is "意味","觉",is error.

but when encountering a string like "气味觉" also , 觉(110) > 气(99) ,we expect segment:"气","味觉",the actual is correct.

code:

use icu::segmenter::{options::WordBreakInvariantOptions, WordSegmenter};
let segmenter =
    WordSegmenter::new_auto(WordBreakInvariantOptions::default());

let breakpoints: Vec<usize> =
    segmenter.segment_str("意味觉").collect();
assert_eq!(&breakpoints, &[0,1,3]); 
# failed, the actual value is [0,2,3]

let breakpoints2: Vec<usize> =
    segmenter.segment_str("气味觉").collect();
assert_eq!(&breakpoints2, &[0,1,3]);
# success

Delsart avatar Oct 13 '25 09:10 Delsart

@aethanyc @makotokato

sffc avatar Oct 13 '25 22:10 sffc

Not sure if the dictionary impl is expected to handle such overlaps, but I don't know much about the implementation, so deferring to Ting-Yu and Makoto.

Manishearth avatar Oct 13 '25 22:10 Manishearth