RapidFuzz
RapidFuzz copied to clipboard
implement banded version of Levenshtein distance
A banded version of the Levenshtein distance algorithm should be implemented as described in https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.1245&rep=rep1&type=pdf.
This would reduce the runtime of string_metric.levenshtein
from O(N/64 * M)
to O(score_cutoff/64 * M
which could significantly improve the performance on long sequences. This could be used the reduce the memory usage of string_metric.levenshtein_editops
as well, since it is only require to store the relevant band of the Levenshtein matrix instead of storing the whole matrix. E.g. for the case described in https://github.com/qurator-spk/dinglehopper/issues/62 this could reduce the memory usage from around 110k**2 * 3bit = 4.2GB
to 110k * 4k * 3bit = 160MB
.
@mikegerber as a first step I updated the algorithm to use the alignment algorithm by Heikki Hirrö (A Note on Bit-Parallel Alignment Computation.
This improves memory usage by 33% from len(s1) * len(s2) * 3 bit
to len(s1) * len(s2) * 2 bit
. In addition this makes the algorithm around 10-20% faster.
/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
0:22.97 real, 16.55 user, 7.33 sys, 5013484 mmem
improved to
/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
0:20.78 real, 14.23 user, 7.30 sys, 3353820 mmem