sacremoses icon indicating copy to clipboard operation
sacremoses copied to clipboard

Tokenization for Hindi (e.g. `क्या`) is weird

Open alvations opened this issue 5 years ago • 6 comments

>>> from sacremoses import MosesTokenizer
>>> mt = MosesTokenizer()
>>> mt.tokenize('क्या')
['क', '्', 'या']

alvations avatar Mar 28 '19 08:03 alvations

The same is true for both Chinese and Korean as well. sacremoses splits all characters:

Here's some Chinese:

>>> mt = MosesTokenizer(lang='zh')
>>> mt.tokenize("记者 应谦 美国")
['记', '者', '应', '谦', '美', '国']

And some Korean:

mt = MosesTokenizer(lang='ko')
mt.tokenize("세계 에서 가장 강력한")
['세', '계', '에', '서', '가', '장', '강', '력', '한']

Which is a shame, as I'd really like to use sacremoses as the tokenizer with LASER instead of using subprocess and temp files to call the moses perl scripts.

johnfarina avatar Jul 16 '19 05:07 johnfarina

Expected behavior for zh and ko:

$ echo "记者 应谦 美国"  | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l zh 
Tokenizer Version 1.1
Language: zh
Number of threads: 1
记者 应谦 美国

$ echo ""세계 에서 가장 강력한""  | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ko 
Tokenizer Version 1.1
Language: ko
Number of threads: 1
WARNING: No known abbreviations for language 'ko', attempting fall-back to English version...
세계 에서 가장 강력한

alvations avatar Jul 16 '19 06:07 alvations

Looks like it's the the unichars list and the perluniprops list of Alphanumeric is a little different.

The issue comes from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L420 where the non-alphanumeric characters are padded with spaces.

It looks like the \p{IsAlnum} includes the CJK:

$ echo "记者 应谦 美国" | sed "s/([^\p{IsAlnum}\s\.\'\`\,\-])/ $1 /g"
记者 应谦 美国

But when we check unichars, it's missing:

$ unichars '\p{Alnum}' | cut -f2 -d' ' | grep "记"

Using the unichars -au option works:

$ unichars -au '\p{Alnum}' | cut -f2 -d' ' | grep "记"
记

Note: see https://webcache.googleusercontent.com/search?q=cache:bmLqeEnWJa0J:https://codeday.me/en/qa/20190306/8531.html+&cd=6&hl=en&ct=clnk&gl=sg

alvations avatar Jul 16 '19 06:07 alvations

@johnfarina Thanks for spotting that! The latest PR should #60 resolve the CJK issues.

The Hindi one is a little more complicated, so leaving this PR open.

pip install -U sacremoses>=0.0.22

alvations avatar Jul 16 '19 07:07 alvations

Oh wow, comment on a github issue, go to bed, wake up, bug is fixed! Thanks so much @alvations !!

johnfarina avatar Jul 16 '19 14:07 johnfarina

@alvations any update on the hindi tokenization issue?

mtresearcher avatar Jun 04 '20 20:06 mtresearcher