fuzzywuzzy icon indicating copy to clipboard operation
fuzzywuzzy copied to clipboard

fuzz.ratio is not always commutative

Open Alberth289346 opened this issue 7 years ago • 6 comments

One of the properties that you'd expect is that the order of the arguments doesn't matter, ie fuzz.ratio(A, B) == fuzz.ratio(B, A) for all strings A and B. This property is also listed at https://en.wikipedia.org/wiki/Edit_distance#Properties as one of the fundamental properties of editing distance.

While it seems to hold most of the times, unfortunately this is not always the case:

from fuzzywuzzy import fuzz

s1 = "tehchqt"
s2 = "tehctht"

print("distance {} -> {}: {}".format(s1, s2, fuzz.ratio(s1, s2)))
print("distance {} -> {}: {}".format(s2, s1, fuzz.ratio(s2, s1)))

gives

/usr/lib/python3.5/site-packages/fuzzywuzzy/fuzz.py:32: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
distance tehchqt -> tehctht: 86
distance tehctht -> tehchqt: 71

I haven't tested it with the levenshtein package.

Alberth289346 avatar Aug 05 '17 07:08 Alberth289346

With python-Levenshtein it appears to be commutative (at least for this example).

>>> from fuzzywuzzy.fuzz import ratio
>>> s1 = 'tehchqt'
>>> s2 = 'tehctht'
>>> ratio(s1, s2)
86
>>> ratio(s2, s1)
86

But I can confirm that it is not commutative without python-Levenshtein.

>>> from fuzzywuzzy.fuzz import ratio
<<WARNINGS>>
>>> s1 = 'tehchqt'
>>> s2 = 'tehctht'
>>> ratio(s1, s2)
86
>>> ratio(s2, s1)
71

I think the issue here is using difflib which doesn't actually give true edit distances - from the docs

This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

If you put these examples through the SequenceMatcher in difflib you will get the same (but unrounded) results.

DavidCEllis avatar Sep 30 '17 11:09 DavidCEllis

difflib sounds as a likely cause indeed. A relatively simple fix for this case could be to try both directions and take the min or max of them. That would fix the reported discrepancy, although it makes the problem of determining what is actually measured more complicated, in particular if difflib result and python-Levenshtein result are not the same.

In the mean time, we decided that normalized equality notion did not work for our project, and instead opted for true edit distance. As such we are not waiting for a fix of this issue.

Alberth289346 avatar Oct 02 '17 14:10 Alberth289346

Probably no good way to fix this if you are using difflib, and documenting that it isn't commutative is our best shot.

josegonzalez avatar Oct 02 '17 15:10 josegonzalez

How can I force fuzz.partial_ratio allow only 1st argument as substring and not the other way round.

For example following example will be giving misleading matches: fuzz.partial_ratio("An Act of Wisdom Play", "Play") => 100%

akashbudhia avatar Jan 05 '18 09:01 akashbudhia

I'm not implementing that in fuzzywuzzy, but you are free to do so in a wrapper function by just checking if one string is in another. Note that you'll definitely incur a performance penalty.

josegonzalez avatar Jan 05 '18 15:01 josegonzalez

Will you able to share some example for this? Coz substring needs to be fuzzy again.

On 5 Jan 2018 9:02 pm, "Jose Diaz-Gonzalez" [email protected] wrote:

I'm not implementing that in fuzzywuzzy, but you are free to do so in a wrapper function by just checking if one string is in another. Note that you'll definitely incur a performance penalty.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/seatgeek/fuzzywuzzy/issues/173#issuecomment-355583080, or mute the thread https://github.com/notifications/unsubscribe-auth/ALn7Gf0Nb08CX020dZ9NSshq6lyQNLuRks5tHkCNgaJpZM4OuZD8 .

akashbudhia avatar Jan 05 '18 18:01 akashbudhia