flashtext icon indicating copy to clipboard operation
flashtext copied to clipboard

Can I use stemmed version of keyphrases to extract them?

Open renatofcorrea opened this issue 5 years ago • 2 comments

Hello, I have a question: Can I use stemmed version of keyphrases to extract them? Because, sometimes is usefull use stem to capture some equivalent expressions with variations, for example {digital library: digital librar} It pattern will match with: digital library, digital libraries, digitalized library.

renatofcorrea avatar Nov 19 '18 17:11 renatofcorrea

Hi, this issue is about fuzzy matching intergration

For my first PR I would like to sumbit a simple version of fuzzy matching, but if it is accepted, we can move forward and try to integrate custom weights for insertions, deletions, and replacements.

I think this feature, applied with low weights for insertions, would "work" for your kind of problem because adding chars would only slightly increase the levensthein dist while it is computed, and a fuzzy match would still be possible.

remiadon avatar Apr 05 '19 16:04 remiadon

Unless I am misunderstanding, the fuzzy matching added in PR #84 doesn't seem to work well for this kind of problem....

>>> processor = flashtext.KeywordProcessor()
>>> processor.add_keywords_from_list(['cat', 'dog'])
>>> processor.extract_keywords('fight like cat and dog')
['cat', 'dog']
>>> processor.extract_keywords('raining cats and dogs')
[]
>>> processor.extract_keywords('raining cats and dogs', max_cost=2)
[]
>>> processor.extract_keywords('raining cats and dogs', max_cost=20)
['cat', 'cat']
>>> processor.extract_keywords('raining cats and dogs', max_cost=200)
['cat', 'cat']
>>> processor.extract_keywords('raining frogs and dogs', max_cost=200)
['cat', 'cat', 'cat']

ecwootten avatar Jan 25 '22 12:01 ecwootten