pisa Non-English tokenizers

Non-English tokenizers

Open yf-hk opened this issue 4 years ago • 2 comments

Describe the solution you'd like For CJK languages, like for example Chinese, words are not separated by spaces. So there usually has a need to use a tokenizer to split sentences into word stems. Like this one: https://github.com/yanyiwu/cppjieba Is it currently doable in Pisa? If not, is there any plan to add this feature in the future?

Additional context

Jun 07 '21 02:06 yf-hk

Yes, it is doable. If you want to see this implemented you can send a PR and we will review it. Thanks

Jun 07 '21 07:06 amallia

Unfortunately, none of us regular contributors have much knowledge of these languages, so we'll need someone with more knowledge step up to be able to properly implement and test it.

If someone would want to help out with that, we can definitely provide some help related to how parsing and tokenizing works within PISA.

Feb 27 '22 21:02 elshize

pisa pisa copied to clipboard

Non-English tokenizers

pisa
pisa copied to clipboard