icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Han/Katakana/Hiragana property for word segmenter's datagen

Open makotokato opened this issue 1 year ago • 5 comments

From https://github.com/unicode-org/icu4x/pull/2209

In datagen of word segmenter, we assign special property for east asian language to use lstm or dictionary. We need to improve CJ support, we have to assign same property as EA or special property.

Also, actually, UAX29's rule has Katakana rules, but we might have to dictionary for Katakana instead of UAX29 rule.

makotokato avatar Jul 25 '22 02:07 makotokato

@makotokato Is this an enhancement that can be done after 1.0, or does it affect the schema of the data?

sffc avatar Jul 30 '22 03:07 sffc

@makotokato Is this an enhancement that can be done after 1.0, or does it affect the schema of the data?

Yes, enhancement issue. Han/Kanji and Hiragana are already handled by dictionary. But UAX29's word segmenter spec is Katakana × Katakana. If using dictionary for Katakana, we have to modify spec or add something notes to UAX29.

makotokato avatar Aug 01 '22 17:08 makotokato

Han and Hiragana are done by https://github.com/unicode-org/icu4x/commit/7215608a0984a33439e43835de17a729a521bd51

makotokato avatar Aug 01 '22 17:08 makotokato

My understanding is that this is fully in the datagen crate (change the outputted rule tables, not not the code that reads from the rule tables). This is a good 1.x issue.

sffc avatar Aug 11 '22 18:08 sffc

Good first issue for someone interested in coming up to speed on rule-based segmentation.

CC @younies

sffc avatar Aug 11 '22 18:08 sffc