simplemma icon indicating copy to clipboard operation
simplemma copied to clipboard

Greedy option seems inconsistent

Open dysby opened this issue 2 years ago • 2 comments

Hi, using your library version: 0.9.1

I found inconsistent behavior when using greedy option. See example below, where I was expecting the lemmatized versions of the text to be equal when we force greedy option.

>>> text_lemmatizer("fire crew", lang="en")
['fire', 'crow']
>>> text_lemmatizer("fire crews", lang="en", greedy=True)
['fire', 'crew']
>>> text_lemmatizer(" ".join(text_lemmatizer("fire crews", lang="en", greedy=True)), lang="en")
['fire', 'crow']

Thanks,

dysby avatar May 26 '23 16:05 dysby

Hi @dysby, good catch!

My guess would be that the results are cached internally, which affects the results of text_lemmatizer(). In any case it is worth looking further into the issue.

adbar avatar May 30 '23 10:05 adbar

I think it has to do with minimum word length in simplemma.py#L495 at latest release 0.9.1.

Not sure if recent code does the same.

dysby avatar May 30 '23 12:05 dysby