flashtext
flashtext copied to clipboard
Can I use stemmed version of keyphrases to extract them?
Hello, I have a question: Can I use stemmed version of keyphrases to extract them? Because, sometimes is usefull use stem to capture some equivalent expressions with variations, for example {digital library: digital librar} It pattern will match with: digital library, digital libraries, digitalized library.
Hi, this issue is about fuzzy matching intergration
For my first PR I would like to sumbit a simple version of fuzzy matching, but if it is accepted, we can move forward and try to integrate custom weights for insertions, deletions, and replacements.
I think this feature, applied with low weights for insertions, would "work" for your kind of problem because adding chars would only slightly increase the levensthein dist while it is computed, and a fuzzy match would still be possible.
Unless I am misunderstanding, the fuzzy matching added in PR #84 doesn't seem to work well for this kind of problem....
>>> processor = flashtext.KeywordProcessor()
>>> processor.add_keywords_from_list(['cat', 'dog'])
>>> processor.extract_keywords('fight like cat and dog')
['cat', 'dog']
>>> processor.extract_keywords('raining cats and dogs')
[]
>>> processor.extract_keywords('raining cats and dogs', max_cost=2)
[]
>>> processor.extract_keywords('raining cats and dogs', max_cost=20)
['cat', 'cat']
>>> processor.extract_keywords('raining cats and dogs', max_cost=200)
['cat', 'cat']
>>> processor.extract_keywords('raining frogs and dogs', max_cost=200)
['cat', 'cat', 'cat']