editdistance icon indicating copy to clipboard operation
editdistance copied to clipboard

Accurate benchmarks?

Open ekreutz opened this issue 5 years ago • 3 comments

Hey, I ran a quick benchmark of my own on, using:

  • macOS 10.14.5 (Mojave)
  • Python 3.7.3
  • python-Levenshtein 0.12.0 (pypi)
  • editdistance 0.5.3 (pypi)

In my tests python-Levenshtein is about 10x faster. Perhaps it's the macOS binaries? Or maybe your tests are outdated?

import editdistance
from Levenshtein._levenshtein import *
from timeit import timeit

s00 = "akjsdjkahsdhjkashd"
s01 = "akjsdjkahsdhj"
s10 = 'xyzz'
s11 = 'xab'
s20 = 'aaaaaaaaaaaaaaaaaaaaaaa'
s21 = 'i'

def a1():
    editdistance.eval(s00, s01)
    editdistance.eval(s10, s11)
    editdistance.eval(s20, s21)
def a2():
    distance(s00, s01)
    distance(s10, s11)
    distance(s20, s21)

print("editdistance")
print(timeit(a1, number=100000))

print("\npython-Levenshtein")
print(timeit(a2, number=100000))

Prints:

editdistance
0.330241583

python-Levenshtein
0.03681695899999998

ekreutz avatar Jun 19 '19 09:06 ekreutz

I just compared those two in a real-life application and editdistance is about 30% faster.

desialex avatar Jul 21 '19 23:07 desialex

At least in my benchmarks this is largely dependent on the length of the input strings. Here is a comparision for different libraries using different string lengths. Both edlib and editdistance appear to have a lot of overhead for short strings.

Only python-Levenshtein uses a quadratic time implementation, while all others use Myers/Hyyrös bitparallel implementation.

maxbachmann avatar Apr 11 '21 12:04 maxbachmann

@maxbachmann Great chart. It shows the choice of implementation really depends on the application.

  • One application would be finding the closest single words, e.g., in spelling correction.
  • The other one would be measuring edit distance for long string sequences, e.g., in comparing the layout of documents as in "Comparison and Classification of Documents Based on Layout Similarity". I regret I haven't done a similar evaluation as yours at the time I was implementing document clustering...

For the latter rapidfuzz seems like a good choice.

dzieciou avatar Jul 25 '21 13:07 dzieciou