fuzzywuzzy Is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm?

Is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm?

Open pras135 opened this issue 6 years ago • 4 comments

trafficstars

As per the documentation of the library, it is mentioned that it uses Levenshtein distance for computing the differences between sequences. But upon close inspection, I find that it actually uses the SequenceMatcher function from the difflib library. And this function, as per documentation uses the Ratcliff/Obershelp pattern-matching algorithm.

As per the definitions, Levenshtein distance is the minimum number of edits needed to transform one string into the other, and Ratcliff/Obershelp pattern-matching algorithm computes the doubled number of matching characters divided by the total number of characters in the two strings. A close related post comparing both.

And when I run an example, I get the same result for SequenceMatcher and ratio function in fuzzywuzzy.

from difflib import SequenceMatcher
from fuzzywuzzy import fuzz
s = SequenceMatcher(None, "abcd", "bcde")
s.ratio()
# 0.75
fuzz.ratio("abcd", "bcde")
# 75

If I compute the Levenshtein distance manually between the two strings, I guess it will be just 2. In this case, how does it become that it uses Levenshtein distance as it is mentioned in the documentation?

Dec 30 '18 18:12 pras135

@acslater00 Can you answer this?

Dec 30 '18 23:12 josegonzalez

if python-Levenshtein is installed, it uses '(lensum - ldist)/(lensum))'

where ldist is calculated with replacement cost of 2.

https://github.com/ztane/python-Levenshtein/blob/master/Levenshtein/_levenshtein.c#L760

Jan 01 '19 14:01 nol13

So to summarize:

The default behavior doesn't compute levenshtein distance but instead computes Ratcliff/Obershelp match ratio, which is similar but not identical to jaccard similarity.
If python-levenshtein is installed, the "levenshtein ratio" is calculated.

This behavior doesn't appear to be documented anywhere beyond the abstract warning in the readme that installing python-levenshtein "may result in differing results for certain cases."

The second line of the readme asserts:

It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package

Considering the default behavior never computes levenshtein, this behavior and the calculations used for computing different similarity functions should be more clearly documented.

May 26 '20 17:05 dmarx

Well seems like a good PR for someone to file to the readme to clarify what it does.

May 28 '20 20:05 josegonzalez

fuzzywuzzy fuzzywuzzy copied to clipboard

Is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm?

fuzzywuzzy
fuzzywuzzy copied to clipboard