fuzzywuzzy icon indicating copy to clipboard operation
fuzzywuzzy copied to clipboard

Is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm?

Open pras135 opened this issue 6 years ago • 4 comments
trafficstars

As per the documentation of the library, it is mentioned that it uses Levenshtein distance for computing the differences between sequences. But upon close inspection, I find that it actually uses the SequenceMatcher function from the difflib library. And this function, as per documentation uses the Ratcliff/Obershelp pattern-matching algorithm.

As per the definitions, Levenshtein distance is the minimum number of edits needed to transform one string into the other, and Ratcliff/Obershelp pattern-matching algorithm computes the doubled number of matching characters divided by the total number of characters in the two strings. A close related post comparing both.

And when I run an example, I get the same result for SequenceMatcher and ratio function in fuzzywuzzy.

from difflib import SequenceMatcher
from fuzzywuzzy import fuzz
s = SequenceMatcher(None, "abcd", "bcde")
s.ratio()
# 0.75
fuzz.ratio("abcd", "bcde")
# 75

If I compute the Levenshtein distance manually between the two strings, I guess it will be just 2. In this case, how does it become that it uses Levenshtein distance as it is mentioned in the documentation?

pras135 avatar Dec 30 '18 18:12 pras135

@acslater00 Can you answer this?

josegonzalez avatar Dec 30 '18 23:12 josegonzalez

if python-Levenshtein is installed, it uses '(lensum - ldist)/(lensum))'

where ldist is calculated with replacement cost of 2.

https://github.com/ztane/python-Levenshtein/blob/master/Levenshtein/_levenshtein.c#L760

nol13 avatar Jan 01 '19 14:01 nol13

So to summarize:

  1. The default behavior doesn't compute levenshtein distance but instead computes Ratcliff/Obershelp match ratio, which is similar but not identical to jaccard similarity.

  2. If python-levenshtein is installed, the "levenshtein ratio" is calculated.

This behavior doesn't appear to be documented anywhere beyond the abstract warning in the readme that installing python-levenshtein "may result in differing results for certain cases."

The second line of the readme asserts:

It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package

Considering the default behavior never computes levenshtein, this behavior and the calculations used for computing different similarity functions should be more clearly documented.

dmarx avatar May 26 '20 17:05 dmarx

Well seems like a good PR for someone to file to the readme to clarify what it does.

josegonzalez avatar May 28 '20 20:05 josegonzalez