python-Levenshtein icon indicating copy to clipboard operation
python-Levenshtein copied to clipboard

Levenshtein distance very slow on large files

Open Jesse-V opened this issue 10 years ago • 5 comments

When calculating the distance between two files that are 1.5 MB in length, the distance call takes a little over an hour on my i7. I'm not familiar with the runtime analysis on the algorithm your using, but it seems to take an inordinate amount of time. Is there anything that can be done to speed it up?

Jesse-V avatar Feb 24 '15 18:02 Jesse-V

Any time I try to process a file over 300k it crashes. (Python 2.7, Windows 7.) Removing python-Levenshtein causes it to take much longer (I'm using fuzzywuzzy) but it doesn't crash. If I get a clean and decent repro with data I can share I'll open a separate issue for it.

markdall avatar Oct 28 '16 00:10 markdall

I do not know what I would expect the performance to be on files as large as 1.5 MB, but the running time for the algorithm as implemented is O(n1 * n2) where nj (j=1,2) is the length of the infix of string j which doesn't include the longest common pre- or postfix.

rljacobson avatar Mar 29 '19 18:03 rljacobson

Data points:

I have updated to Python 3.7. I'm not seeing crashes any longer. However, running the same script on the same inputs is much slower and yields different numbers when python-Levenshtein is installed, specifically using fuzzywuzzy's partial_ratio.

This is on Windows 10 (latest updates) Intel Xeon E5-2630 v3 @2.4GHz (it's single-threaded). It's taking forever to finish with the library installed so I will cut it short and just paste it as-is: https://imgur.com/0unNTWv

It's interesting how different the ratio values are, in addition to how much longer it takes to execute. https://imgur.com/j628B98 <-- For reference each file in the chart here and above are being compared against a snippet of a log file which is 174,548 bytes in size. I'm actually running the entire thing against a few different log file snippets but it's taking too long and I was eager to post this.

Edit: It finished overnight: https://imgur.com/a/Vrp6KQy - wow.

markdall avatar Apr 02 '19 22:04 markdall

@markdall I can think of three possible explanations:

  • There's just a bug. This is boring, but possible.
  • The algorithm is using an early bailout mechanism by default which results in an inaccurate result being reported. Note that when they disagree, the non-python-levenshtein result is always much smaller, with faster computation. That adds a lot of credence to this theory.
  • They are using slightly different algorithms. One might be Levenshtein and the other Levenshtein-Damerau. There are other variants. This explanation seems unlikely considering how dramatic the differences are.

rljacobson avatar Apr 29 '19 12:04 rljacobson

@rljacobson Thanks!

markdall avatar May 02 '19 21:05 markdall