Arraymancer icon indicating copy to clipboard operation
Arraymancer copied to clipboard

Add an English NLP tokenizer

Open mratsim opened this issue 5 years ago • 2 comments

There is no tokenizer currently which makes parsing text and using word embeddings very hard. I.e. currently it's either a split on white spaces or regexp.

mratsim avatar Nov 05 '18 23:11 mratsim

Splitting on white space and regexp seems like it could still be ok?

  • sklearn does the same but provides the ability for the user to give a callable https://github.com/scikit-learn/scikit-learn/blob/bac89c2/sklearn/feature_extraction/text.py#L260

  • spacy has (pre/suf/in)-fix regexp in addition, as well as full word regexp: https://github.com/explosion/spacy/blob/master/spacy/tokenizer.pyx#L20

In contrast:

  • open nlp has a learnable tokenizer, using maximum entropy: https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html#tools.tokenizer.introduction

  • https://github.com/jirkamarsik/trainable-tokenizer also has a maximum entropy mode

Are you think more of the former? Or is there someting I'm missing about how whitespace/regexp tokenization makes the embeddings difficult?

metasyn avatar Nov 18 '18 02:11 metasyn

Whitespace is OK for a start but it doesn't work with characters like ? or ! or '.

Furthermore, having a tokenizer type/API is key for languages that don't use whitespaces at all (like Chinese).

Anyway the first thing is to have a white space tokenizer ;)

mratsim avatar Dec 08 '18 22:12 mratsim