fuzzywuzzy
fuzzywuzzy copied to clipboard
Broken partial_ratio functionality with python-Levenshtein
The partial_ratio calculation seems to yield incorrect results for certain combinations of strings when it uses the python-Levenshtein SequenceMatcher.
This works well:
> fuzz.partial_ratio('this is a test', 'is this is a not this is a test!')
> 100
But changing the longer string slightly, while not affecting the target:
> fuzz.partial_ratio('this is a test', 'is this is a not really thing this is a test!')
> 92
Digging deeper, it appears that the get_matching_blocks() method returns substrings that do not actually match the string we are searching for, so the subsequent ratio calculations are performed on a set of poorly-matched ones.
Removing python-Levenshtein and using the python-only SequenceMatcher makes that method perform its job correctly. I couldn't figure out what it was about certain strings that made it break, after trying a whole bunch.
To top it off, the python-Levenshtein library appears to have been left unsupported for a while now. Any ideas? Maybe for now, removing the recommendation to use python-Levenshtein would let code run correctly, if not as fast? Thanks!
Yes! It was driving me crazy today! I was trying this:
fuzz.partial_ratio('Solo Tango Paisaje / Pablo Rodriguez - Corina Hererra / Planetango XIII'.lower(), 'Herrera'.lower())
and I get 29 score. 29 score matching Hererra with Herrera? Removing random parts of the first string seems to improve the score. I will try your suggestion of removing Levenshtein.
This seems kind of a critical bug though, and I see now it's been over a year it was posted.
Indeed removing python-Levenshtein fixed the problem! Thanks!
Thanks to this bug - i didnt install python-Levenshtein yet :-)
@acslater00 thoughts on this?
Interesting -- I'll take a look at this
seems like this is a known issue https://github.com/ztane/python-Levenshtein/issues/16
My assumption was that get_matching_blocks() would always return the longest continuous block, but that seems to not be the case in some edge cases.
this is a pretty critical bug, are there plans to work on this? Warning should probably be removed in the meantime
@eliot1019 this code works well enough for our use cases, but if you'd like to fork/fix python-levenshtein, then maybe some progress can be made? The warning is because you're using a slower implementation which doesn't quite have the same output either, depending on your input.
This issue is continuing to give people grief - see this SO question and my answer.
For a really minimal example, consider:
fuzz.partial_ratio('test', 't e s test t')
This returns 50. It should be 100.
A slightly more real-world example:
fuzz.partial_ratio('fat', 'find a fat cat')
This returns 67. It should be 100.
Why is this happening? There are two possible sets of minimum operations to convert "test" into "t e s test t". python-Levenshtein chooses the one that involves deleting the continuous "test" substring. I don't see this as an issue with python-Levenshtein, as the calculation of Levenshtein distance doesn't explicitly give weight to continuous blocks.
There is potential for this bug to pop up any time that the shorter string is found as a non-continuous sequence in the longer string. This is unexpected, hard to account for, and potentially very problematic. I don't see this as being an isolated edge case. Perhaps it would be sensible to remove the use of python-Levenshtein until a solution is found? diffLib does seem to handle this better.
@andrewguy I agree with this. I already use a implementation based on difflib in my implementation in RapidFuzz, which works correctly. The main argument for the implementation of python-Levenshtein appears to be it's performance. However the difflib based implementation in RapidFuzz is in no way slower.
@acslater00 @josegonzalez I would be willing to port my implementation to FuzzyWuzzy. Probably the simplest solution would be a get_matching_blocks API in RapidFuzz, so there is no addition maintenance effort for SeatGeek. However looking at recent Pull Requests it appears like SeatGeek does not has the time to review any Pull Requests. In case your interested please let me know, since I do not want to waste my time on a solution that will not be reviewed.