ordia
ordia copied to clipboard
Improve tokenization for CJK languages
Japanese example attached — the sentence 下記方法で体内への侵入を防止すること
from here should be tokenized somewhat like the following, with a single pipe character standing for a word boundary, two for lexeme boundaries more generally:
下記方法|で||体内||へ||の||侵入||を||防止|する||こと
Similar issues with Korean and Chinese texts, so keeping them together for now.
I think Korean might be able to get away with this with how modern texts are written with word-spaces. I guess that would be different for more casual text?
For Chinese tokenization I usually recommend jieba. It's... good. I can't even think of another tokenizer off my head. And the dictionaries are not big -- the HMM magic takes care of unknowns.
I think there are also Japanese tokenizers in Python (truth be told, everything data-splitty-chunky is written in Python these days), but as I don't speak it I have no idea what to use. The first Google result is something called fugashi. The author looks very serious, but the dictionary is big.
I am unsure how to do CJK tokenization. As I understand there are no easy way like splitting on a character. So one should use a tool with a dictionary like jieba.