python-Levenshtein
python-Levenshtein copied to clipboard
Levenshtein distance very slow on large files
When calculating the distance between two files that are 1.5 MB in length, the distance call takes a little over an hour on my i7. I'm not familiar with the runtime analysis on the algorithm your using, but it seems to take an inordinate amount of time. Is there anything that can be done to speed it up?
Any time I try to process a file over 300k it crashes. (Python 2.7, Windows 7.) Removing python-Levenshtein causes it to take much longer (I'm using fuzzywuzzy) but it doesn't crash. If I get a clean and decent repro with data I can share I'll open a separate issue for it.
I do not know what I would expect the performance to be on files as large as 1.5 MB, but the running time for the algorithm as implemented is O(n1 * n2) where nj (j=1,2) is the length of the infix of string j which doesn't include the longest common pre- or postfix.
Data points:
I have updated to Python 3.7. I'm not seeing crashes any longer. However, running the same script on the same inputs is much slower and yields different numbers when python-Levenshtein is installed, specifically using fuzzywuzzy's partial_ratio.
This is on Windows 10 (latest updates) Intel Xeon E5-2630 v3 @2.4GHz (it's single-threaded). It's taking forever to finish with the library installed so I will cut it short and just paste it as-is: https://imgur.com/0unNTWv
It's interesting how different the ratio values are, in addition to how much longer it takes to execute. https://imgur.com/j628B98 <-- For reference each file in the chart here and above are being compared against a snippet of a log file which is 174,548 bytes in size. I'm actually running the entire thing against a few different log file snippets but it's taking too long and I was eager to post this.
Edit: It finished overnight: https://imgur.com/a/Vrp6KQy - wow.
@markdall I can think of three possible explanations:
- There's just a bug. This is boring, but possible.
- The algorithm is using an early bailout mechanism by default which results in an inaccurate result being reported. Note that when they disagree, the non-python-levenshtein result is always much smaller, with faster computation. That adds a lot of credence to this theory.
- They are using slightly different algorithms. One might be Levenshtein and the other Levenshtein-Damerau. There are other variants. This explanation seems unlikely considering how dramatic the differences are.
@rljacobson Thanks!