Lemmatizing tokenizer
gensim has a lemmatizing tokenizer, which, instead of stemming words, converts them to their lemma. For instance, "was," "being," "am" would tokenize to "be."
https://radimrehurek.com/gensim/utils.html
Can't figure out, is it based on wordnet?
@dselivanov It looks like gensim provides the lemmatizing function via the Python patterns package. I see some references to wordnet in that package, but it appears to use a rule based function:
https://github.com/clips/pattern/blob/820cccf33c6ac4a4f1564a273137171cfa6ab7cb/pattern/text/en/inflect.py#L645
Wordnet seems like the solution I would use.
Wordnet seems like a good approach to take but it's pretty substantial in terms of codebase, and a certain amount ornery around windows. There's also LemmaGen, which is sat in C++, appears a lot less complainy when it comes to multi-platform installs, and has support for a ton of non-EN languages too.
In both cases I guess I'd worry about how it'd increase the size of the codebase. Could this do better in a distinct, suggested/recommended package, maybe? lemmatizers and tokenizers
@Ironholds Yes, it might be a good idea to break them out into separate packages. There is a wordnet package already (https://cran.rstudio.com/web/packages/wordnet/) but I haven't looked closely at its functionality, and ideally this would be possible without using Java.
I have no immediate plans to add this functionality here or in a separate package, but this issue is just to remind me to look into this more closely at some point.
Gotcha! Okay.