johnfarina comments

Repositories
Issues
Comments

Results 2 comments of


                                            johnfarina

Tokenization for Hindi (e.g. `क्या`) is weird

The same is true for both Chinese and Korean as well. sacremoses splits all characters: Here's some Chinese: ``` >>> mt = MosesTokenizer(lang='zh') >>> mt.tokenize("记者应谦美国") ['记', '者', '应',...

Tokenization for Hindi (e.g. `क्या`) is weird

Oh wow, comment on a github issue, go to bed, wake up, bug is fixed! Thanks so much @alvations !!