pisa
pisa copied to clipboard
Non-English tokenizers
Describe the solution you'd like For CJK languages, like for example Chinese, words are not separated by spaces. So there usually has a need to use a tokenizer to split sentences into word stems. Like this one: https://github.com/yanyiwu/cppjieba Is it currently doable in Pisa? If not, is there any plan to add this feature in the future?
Additional context
Yes, it is doable. If you want to see this implemented you can send a PR and we will review it. Thanks
Unfortunately, none of us regular contributors have much knowledge of these languages, so we'll need someone with more knowledge step up to be able to properly implement and test it.
If someone would want to help out with that, we can definitely provide some help related to how parsing and tokenizing works within PISA.