fuzzywuzzy
fuzzywuzzy copied to clipboard
problem of fuzz.ratio with newer ver (22) of python-Lev. distance
Hey, After testing failure of fuzz.ratio at the detection of basic sim. match, I found that the newer ver of python-Levenshtein(0.22.0) makes the fuzz.ratio output wrong answer when installing it with fuzzywuzzy (latest version of fuzzy). Here is an example of the wrong output: code: from fuzzywuzzy import fuzz list_to_search_from = ['Barcelona', 'Real Madrid'] text_to_search = 'real' scores = {search_option: fuzz.ratio(search_option, text_to_search) for search_option in list_to_search_from} output: {‘Barcelona’: 46, ‘Real Madrid’: 40} with ver 21.1 {'Barcelona': 31, 'Real Madrid': 40} which is the right answer.
After downgrade to 21.1 it works nice, didn't find where it fails on 22. Note, without python-Levenshtein install at all (clean, checked with colab with and without) it works like a charm. Wonder why it happens (maybe wrong calc of Lev. dist at the new ver) !pip install python-Levenshtein==0.21.1
Why do you consider the score 46 as incorrect? I understand you would like to find Real Madrid
as best match for real
, but according to the normalized Indel similarity it simply isn't.
To convert from real
to Barcelona
the following operations are required:
__r_e___al
Barcelona_
These are 7 operation. Now this is normalized in the following way: 1.0 - (dist / (len(s1) + len(s2)))
-> 0.46
.
On the other hand for Real Madrid
the following operations are required:
_real_______
R_eal Madrid
These are 9 operations. Normalized in the same way you get 0.4
.
Note, without python-Levenshtein install at all (clean, checked with colab with and without) it works like a charm.
No clue how that would work, since in this specific case the difflib fallback actually returns the same results:
>>> import difflib
>>> difflib.SequenceMatcher(None, 'real', 'Barcelona').ratio()
0.46153846153846156
>>> difflib.SequenceMatcher(None, 'real', 'Real Madrid').ratio()
0.4
after investigation you are indeed right, how do you offer to resolve such problem? partial_ratio is not an offer as "Real Madrid" and "Real Saragosa" are the same for "Real", and their partial_ratio rate should be the same. I wonder how to attack such problem
I went with fuzz.partial_token_set_ratio, as it should fit better I think