fuzzywuzzy process.extract() with scorer partial

process.extract() with scorer partial_ratio returns wrong results

Open SujaySKumar opened this issue 6 years ago • 8 comments

Correct answer to the following command should be 100.

>>> fuzz.partial_ratio("thane", "nation hospitality honda water thane thane west")
40

Removal of any word from the string nation hospitality honda water thane thane west results in the correct answer of 100.

This issue is reproducible in all installations (Irrespective of whether python-levenshtein is installed or not). Versions:

fuzzywuzzy         0.16.0
python-levenshtein 0.12.0
Python 3.6

Sep 11 '18 14:09 SujaySKumar

Is there a reason the partial ratio result should be 100? And can you add a failing test case to our test suite to prove this?

Sep 12 '18 03:09 josegonzalez

Yes. Since the shorter string is a substring of the longer string, partial_ratio should be 100. This is described in detail in github documentation as well as the blog

fuzz.ratio("YANKEES", "NEW YOR") ⇒ 14 fuzz.ratio("YANKEES", "EW YORK") ⇒ 28 fuzz.ratio("YANKEES", "W YORK ") ⇒ 28 fuzz.ratio("YANKEES", " YORK Y") ⇒ 28 ... fuzz.ratio("YANKEES", "YANKEES") ⇒ 100 and conclude that the last one is clearly the best. It turns out that “Yankees” and “New York Yankees” are a perfect partial match…the shorter string is a substring of the longer. We have a helper function for this too (and it’s far more efficient than the simplified algorithm I just laid out) fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100 fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 69

Sep 12 '18 06:09 SujaySKumar

Do you mind adding the appropriate tests to test_fuzzywuzzy.py so that CI hits it and we can see the test fails?

Sep 12 '18 06:09 josegonzalez

CI passes since it uses Python 3.5.3. This issue seems to happen in Python 3.6 or even 3.5.6

Sep 14 '18 06:09 SujaySKumar

Mind filing a PR to use 3.5.6?

Sep 14 '18 15:09 josegonzalez

I can confirm that this also happens in Python 3.7

Aug 22 '19 08:08 lisabutti

in python 3.7, a shorter example that gives the same result:

>>> fuzz.partial_ratio("thane", "t hosa na e thane ws")
40

Sep 05 '19 14:09 gw00207

Is there any solution for this issue?

Jun 16 '20 23:06 Lychfindel

fuzzywuzzy fuzzywuzzy copied to clipboard

process.extract() with scorer partial_ratio returns wrong results

fuzzywuzzy
fuzzywuzzy copied to clipboard