fuzzywuzzy
fuzzywuzzy copied to clipboard
Is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm?
As per the documentation of the library, it is mentioned that it uses Levenshtein distance for computing the differences between sequences. But upon close inspection, I find that it actually uses the SequenceMatcher function from the difflib library. And this function, as per documentation uses the Ratcliff/Obershelp pattern-matching algorithm.
As per the definitions, Levenshtein distance is the minimum number of edits needed to transform one string into the other, and Ratcliff/Obershelp pattern-matching algorithm computes the doubled number of matching characters divided by the total number of characters in the two strings. A close related post comparing both.
And when I run an example, I get the same result for SequenceMatcher and ratio function in fuzzywuzzy.
from difflib import SequenceMatcher
from fuzzywuzzy import fuzz
s = SequenceMatcher(None, "abcd", "bcde")
s.ratio()
# 0.75
fuzz.ratio("abcd", "bcde")
# 75
If I compute the Levenshtein distance manually between the two strings, I guess it will be just 2. In this case, how does it become that it uses Levenshtein distance as it is mentioned in the documentation?
@acslater00 Can you answer this?
if python-Levenshtein is installed, it uses '(lensum - ldist)/(lensum))'
where ldist is calculated with replacement cost of 2.
https://github.com/ztane/python-Levenshtein/blob/master/Levenshtein/_levenshtein.c#L760
So to summarize:
-
The default behavior doesn't compute levenshtein distance but instead computes Ratcliff/Obershelp match ratio, which is similar but not identical to jaccard similarity.
-
If python-levenshtein is installed, the "levenshtein ratio" is calculated.
This behavior doesn't appear to be documented anywhere beyond the abstract warning in the readme that installing python-levenshtein "may result in differing results for certain cases."
The second line of the readme asserts:
It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package
Considering the default behavior never computes levenshtein, this behavior and the calculations used for computing different similarity functions should be more clearly documented.
Well seems like a good PR for someone to file to the readme to clarify what it does.