sacremoses
sacremoses copied to clipboard
Tokenization for Hindi (e.g. `क्या`) is weird
>>> from sacremoses import MosesTokenizer
>>> mt = MosesTokenizer()
>>> mt.tokenize('क्या')
['क', '्', 'या']
The same is true for both Chinese and Korean as well. sacremoses splits all characters:
Here's some Chinese:
>>> mt = MosesTokenizer(lang='zh')
>>> mt.tokenize("记者 应谦 美国")
['记', '者', '应', '谦', '美', '国']
And some Korean:
mt = MosesTokenizer(lang='ko')
mt.tokenize("세계 에서 가장 강력한")
['세', '계', '에', '서', '가', '장', '강', '력', '한']
Which is a shame, as I'd really like to use sacremoses as the tokenizer with LASER instead of using subprocess and temp files to call the moses perl scripts.
Expected behavior for zh and ko:
$ echo "记者 应谦 美国" | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l zh
Tokenizer Version 1.1
Language: zh
Number of threads: 1
记者 应谦 美国
$ echo ""세계 에서 가장 강력한"" | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ko
Tokenizer Version 1.1
Language: ko
Number of threads: 1
WARNING: No known abbreviations for language 'ko', attempting fall-back to English version...
세계 에서 가장 강력한
Looks like it's the the unichars
list and the perluniprops
list of Alphanumeric is a little different.
The issue comes from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L420 where the non-alphanumeric characters are padded with spaces.
It looks like the \p{IsAlnum}
includes the CJK:
$ echo "记者 应谦 美国" | sed "s/([^\p{IsAlnum}\s\.\'\`\,\-])/ $1 /g"
记者 应谦 美国
But when we check unichars
, it's missing:
$ unichars '\p{Alnum}' | cut -f2 -d' ' | grep "记"
Using the unichars -au
option works:
$ unichars -au '\p{Alnum}' | cut -f2 -d' ' | grep "记"
记
Note: see https://webcache.googleusercontent.com/search?q=cache:bmLqeEnWJa0J:https://codeday.me/en/qa/20190306/8531.html+&cd=6&hl=en&ct=clnk&gl=sg
@johnfarina Thanks for spotting that! The latest PR should #60 resolve the CJK issues.
The Hindi one is a little more complicated, so leaving this PR open.
pip install -U sacremoses>=0.0.22
Oh wow, comment on a github issue, go to bed, wake up, bug is fixed! Thanks so much @alvations !!
@alvations any update on the hindi tokenization issue?