Arraymancer
Arraymancer copied to clipboard
Add an English NLP tokenizer
There is no tokenizer currently which makes parsing text and using word embeddings very hard. I.e. currently it's either a split on white spaces or regexp.
Splitting on white space and regexp seems like it could still be ok?
-
sklearn does the same but provides the ability for the user to give a callable https://github.com/scikit-learn/scikit-learn/blob/bac89c2/sklearn/feature_extraction/text.py#L260
-
spacy has (pre/suf/in)-fix regexp in addition, as well as full word regexp: https://github.com/explosion/spacy/blob/master/spacy/tokenizer.pyx#L20
In contrast:
-
open nlp has a learnable tokenizer, using maximum entropy: https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html#tools.tokenizer.introduction
-
https://github.com/jirkamarsik/trainable-tokenizer also has a maximum entropy mode
Are you think more of the former? Or is there someting I'm missing about how whitespace/regexp tokenization makes the embeddings difficult?
Whitespace is OK for a start but it doesn't work with characters like ?
or !
or '
.
Furthermore, having a tokenizer type/API is key for languages that don't use whitespaces at all (like Chinese).
Anyway the first thing is to have a white space tokenizer ;)