fuzzywuzzy
fuzzywuzzy copied to clipboard
process.extract() with scorer partial_ratio returns wrong results
Correct answer to the following command should be 100.
>>> fuzz.partial_ratio("thane", "nation hospitality honda water thane thane west")
40
Removal of any word from the string nation hospitality honda water thane thane west
results in the correct answer of 100.
This issue is reproducible in all installations (Irrespective of whether python-levenshtein is installed or not). Versions:
fuzzywuzzy 0.16.0
python-levenshtein 0.12.0
Python 3.6
Is there a reason the partial ratio result should be 100? And can you add a failing test case to our test suite to prove this?
Yes. Since the shorter string is a substring of the longer string, partial_ratio should be 100. This is described in detail in github documentation as well as the blog
fuzz.ratio("YANKEES", "NEW YOR") ⇒ 14 fuzz.ratio("YANKEES", "EW YORK") ⇒ 28 fuzz.ratio("YANKEES", "W YORK ") ⇒ 28 fuzz.ratio("YANKEES", " YORK Y") ⇒ 28 ... fuzz.ratio("YANKEES", "YANKEES") ⇒ 100 and conclude that the last one is clearly the best. It turns out that “Yankees” and “New York Yankees” are a perfect partial match…the shorter string is a substring of the longer. We have a helper function for this too (and it’s far more efficient than the simplified algorithm I just laid out) fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100 fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 69
Do you mind adding the appropriate tests to test_fuzzywuzzy.py so that CI hits it and we can see the test fails?
CI passes since it uses Python 3.5.3. This issue seems to happen in Python 3.6 or even 3.5.6
Mind filing a PR to use 3.5.6?
I can confirm that this also happens in Python 3.7
in python 3.7, a shorter example that gives the same result:
>>> fuzz.partial_ratio("thane", "t hosa na e thane ws")
40
Is there any solution for this issue?