simplemma
simplemma copied to clipboard
Greedy option seems inconsistent
Hi, using your library version: 0.9.1
I found inconsistent behavior when using greedy option. See example below, where I was expecting the lemmatized versions of the text to be equal when we force greedy option.
>>> text_lemmatizer("fire crew", lang="en")
['fire', 'crow']
>>> text_lemmatizer("fire crews", lang="en", greedy=True)
['fire', 'crew']
>>> text_lemmatizer(" ".join(text_lemmatizer("fire crews", lang="en", greedy=True)), lang="en")
['fire', 'crow']
Thanks,
Hi @dysby, good catch!
My guess would be that the results are cached internally, which affects the results of text_lemmatizer(). In any case it is worth looking further into the issue.
I think it has to do with minimum word length in simplemma.py#L495 at latest release 0.9.1.
Not sure if recent code does the same.