fuzzywuzzy
fuzzywuzzy copied to clipboard
fuzz.ratio is not always commutative
One of the properties that you'd expect is that the order of the arguments doesn't matter, ie fuzz.ratio(A, B) == fuzz.ratio(B, A)
for all strings A
and B
. This property is also listed at https://en.wikipedia.org/wiki/Edit_distance#Properties as one of the fundamental properties of editing distance.
While it seems to hold most of the times, unfortunately this is not always the case:
from fuzzywuzzy import fuzz
s1 = "tehchqt"
s2 = "tehctht"
print("distance {} -> {}: {}".format(s1, s2, fuzz.ratio(s1, s2)))
print("distance {} -> {}: {}".format(s2, s1, fuzz.ratio(s2, s1)))
gives
/usr/lib/python3.5/site-packages/fuzzywuzzy/fuzz.py:32: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
distance tehchqt -> tehctht: 86
distance tehctht -> tehchqt: 71
I haven't tested it with the levenshtein
package.
With python-Levenshtein it appears to be commutative (at least for this example).
>>> from fuzzywuzzy.fuzz import ratio
>>> s1 = 'tehchqt'
>>> s2 = 'tehctht'
>>> ratio(s1, s2)
86
>>> ratio(s2, s1)
86
But I can confirm that it is not commutative without python-Levenshtein.
>>> from fuzzywuzzy.fuzz import ratio
<<WARNINGS>>
>>> s1 = 'tehchqt'
>>> s2 = 'tehctht'
>>> ratio(s1, s2)
86
>>> ratio(s2, s1)
71
I think the issue here is using difflib which doesn't actually give true edit distances - from the docs
This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.
If you put these examples through the SequenceMatcher in difflib you will get the same (but unrounded) results.
difflib sounds as a likely cause indeed. A relatively simple fix for this case could be to try both directions and take the min or max of them. That would fix the reported discrepancy, although it makes the problem of determining what is actually measured more complicated, in particular if difflib result and python-Levenshtein result are not the same.
In the mean time, we decided that normalized equality notion did not work for our project, and instead opted for true edit distance. As such we are not waiting for a fix of this issue.
Probably no good way to fix this if you are using difflib, and documenting that it isn't commutative is our best shot.
How can I force fuzz.partial_ratio allow only 1st argument as substring and not the other way round.
For example following example will be giving misleading matches: fuzz.partial_ratio("An Act of Wisdom Play", "Play") => 100%
I'm not implementing that in fuzzywuzzy, but you are free to do so in a wrapper function by just checking if one string is in another. Note that you'll definitely incur a performance penalty.
Will you able to share some example for this? Coz substring needs to be fuzzy again.
On 5 Jan 2018 9:02 pm, "Jose Diaz-Gonzalez" [email protected] wrote:
I'm not implementing that in fuzzywuzzy, but you are free to do so in a wrapper function by just checking if one string is in another. Note that you'll definitely incur a performance penalty.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/seatgeek/fuzzywuzzy/issues/173#issuecomment-355583080, or mute the thread https://github.com/notifications/unsubscribe-auth/ALn7Gf0Nb08CX020dZ9NSshq6lyQNLuRks5tHkCNgaJpZM4OuZD8 .