ordia icon indicating copy to clipboard operation
ordia copied to clipboard

Improve tokenization for CJK languages

Open Daniel-Mietchen opened this issue 4 years ago • 2 comments

Japanese example attached — the sentence 下記方法で体内への侵入を防止すること from here should be tokenized somewhat like the following, with a single pipe character standing for a word boundary, two for lexeme boundaries more generally:

下記方法|で||体内||へ||の||侵入||を||防止|する||こと

Similar issues with Korean and Chinese texts, so keeping them together for now.

Screenshot_2020-04-08 https tools wmflabs org

Daniel-Mietchen avatar Apr 08 '20 14:04 Daniel-Mietchen

I think Korean might be able to get away with this with how modern texts are written with word-spaces. I guess that would be different for more casual text?

For Chinese tokenization I usually recommend jieba. It's... good. I can't even think of another tokenizer off my head. And the dictionaries are not big -- the HMM magic takes care of unknowns.

I think there are also Japanese tokenizers in Python (truth be told, everything data-splitty-chunky is written in Python these days), but as I don't speak it I have no idea what to use. The first Google result is something called fugashi. The author looks very serious, but the dictionary is big.

Artoria2e5 avatar Sep 15 '21 17:09 Artoria2e5

I am unsure how to do CJK tokenization. As I understand there are no easy way like splitting on a character. So one should use a tool with a dictionary like jieba.

fnielsen avatar Sep 17 '21 13:09 fnielsen