icu4x
icu4x copied to clipboard
Dictionary word segment error
there tree entries in cjdict.txt
意味 70 味觉 110 气味 99
when encountering a string like "意味觉". since 味觉(110) > 意味(70),so we expect segment:"意","味觉",the actual is "意味","觉",is error.
but when encountering a string like "气味觉" also , 味觉(110) > 气味(99) ,we expect segment:"气","味觉",the actual is correct.
code:
use icu::segmenter::{options::WordBreakInvariantOptions, WordSegmenter};
let segmenter =
WordSegmenter::new_auto(WordBreakInvariantOptions::default());
let breakpoints: Vec<usize> =
segmenter.segment_str("意味觉").collect();
assert_eq!(&breakpoints, &[0,1,3]);
# failed, the actual value is [0,2,3]
let breakpoints2: Vec<usize> =
segmenter.segment_str("气味觉").collect();
assert_eq!(&breakpoints2, &[0,1,3]);
# success
@aethanyc @makotokato
Not sure if the dictionary impl is expected to handle such overlaps, but I don't know much about the implementation, so deferring to Ting-Yu and Makoto.