tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Lemmatizing tokenizer

Open lmullen opened this issue 9 years ago • 5 comments

gensim has a lemmatizing tokenizer, which, instead of stemming words, converts them to their lemma. For instance, "was," "being," "am" would tokenize to "be."

https://radimrehurek.com/gensim/utils.html

lmullen avatar Mar 25 '16 19:03 lmullen

Can't figure out, is it based on wordnet?

dselivanov avatar Mar 27 '16 10:03 dselivanov

@dselivanov It looks like gensim provides the lemmatizing function via the Python patterns package. I see some references to wordnet in that package, but it appears to use a rule based function:

https://github.com/clips/pattern/blob/820cccf33c6ac4a4f1564a273137171cfa6ab7cb/pattern/text/en/inflect.py#L645

Wordnet seems like the solution I would use.

lmullen avatar Mar 30 '16 02:03 lmullen

Wordnet seems like a good approach to take but it's pretty substantial in terms of codebase, and a certain amount ornery around windows. There's also LemmaGen, which is sat in C++, appears a lot less complainy when it comes to multi-platform installs, and has support for a ton of non-EN languages too.

In both cases I guess I'd worry about how it'd increase the size of the codebase. Could this do better in a distinct, suggested/recommended package, maybe? lemmatizers and tokenizers

Ironholds avatar Mar 31 '17 23:03 Ironholds

@Ironholds Yes, it might be a good idea to break them out into separate packages. There is a wordnet package already (https://cran.rstudio.com/web/packages/wordnet/) but I haven't looked closely at its functionality, and ideally this would be possible without using Java.

I have no immediate plans to add this functionality here or in a separate package, but this issue is just to remind me to look into this more closely at some point.

lmullen avatar Apr 01 '17 14:04 lmullen

Gotcha! Okay.

Ironholds avatar Apr 01 '17 20:04 Ironholds