editdistance
editdistance copied to clipboard
Accurate benchmarks?
Hey, I ran a quick benchmark of my own on, using:
- macOS 10.14.5 (Mojave)
- Python 3.7.3
-
python-Levenshtein 0.12.0
(pypi) -
editdistance 0.5.3
(pypi)
In my tests python-Levenshtein
is about 10x faster. Perhaps it's the macOS binaries? Or maybe your tests are outdated?
import editdistance
from Levenshtein._levenshtein import *
from timeit import timeit
s00 = "akjsdjkahsdhjkashd"
s01 = "akjsdjkahsdhj"
s10 = 'xyzz'
s11 = 'xab'
s20 = 'aaaaaaaaaaaaaaaaaaaaaaa'
s21 = 'i'
def a1():
editdistance.eval(s00, s01)
editdistance.eval(s10, s11)
editdistance.eval(s20, s21)
def a2():
distance(s00, s01)
distance(s10, s11)
distance(s20, s21)
print("editdistance")
print(timeit(a1, number=100000))
print("\npython-Levenshtein")
print(timeit(a2, number=100000))
Prints:
editdistance
0.330241583
python-Levenshtein
0.03681695899999998
I just compared those two in a real-life application and editdistance is about 30% faster.
At least in my benchmarks this is largely dependent on the length of the input strings. Here is a comparision for different libraries using different string lengths. Both edlib and editdistance appear to have a lot of overhead for short strings.
Only python-Levenshtein uses a quadratic time implementation, while all others use Myers/Hyyrös bitparallel implementation.
@maxbachmann Great chart. It shows the choice of implementation really depends on the application.
- One application would be finding the closest single words, e.g., in spelling correction.
- The other one would be measuring edit distance for long string sequences, e.g., in comparing the layout of documents as in "Comparison and Classification of Documents Based on Layout Similarity". I regret I haven't done a similar evaluation as yours at the time I was implementing document clustering...
For the latter rapidfuzz
seems like a good choice.