tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Feature Request: Customizable Word Tokenizers - Spacy

Open sai-prasanna opened this issue 4 years ago • 2 comments

Spacy has customizable word level tokenizers with rules for multiple languages. I think porting that to rust would add nicely to this package. Having a customizable uniform word level tokenization across platforms (client web, server) and languages would be beneficial. Currently, idk any clean way or whether it's even possible to write bindings for spacy cython.

Spacy Tokenizer Code

https://github.com/explosion/spaCy/blob/master/spacy/tokenizer.pyx

Tokenizer exceptions for english

https://github.com/explosion/spaCy/blob/master/spacy/lang/en/tokenizer_exceptions.py

I can put in some time doing this.

sai-prasanna avatar Jan 23 '20 06:01 sai-prasanna

are you working on this @sai-prasanna?

jxmorris12 avatar Jun 11 '20 22:06 jxmorris12

@jxmorris12 Haven't started it yet, want to collabrate? I cannot commit much time this month on coding. But can help out reviewing, brainstorming etc.

sai-prasanna avatar Jun 21 '20 12:06 sai-prasanna