RapidFuzz icon indicating copy to clipboard operation
RapidFuzz copied to clipboard

implement banded version of Levenshtein distance

Open maxbachmann opened this issue 2 years ago • 1 comments

A banded version of the Levenshtein distance algorithm should be implemented as described in https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.1245&rep=rep1&type=pdf. This would reduce the runtime of string_metric.levenshtein from O(N/64 * M) to O(score_cutoff/64 * M which could significantly improve the performance on long sequences. This could be used the reduce the memory usage of string_metric.levenshtein_editops as well, since it is only require to store the relevant band of the Levenshtein matrix instead of storing the whole matrix. E.g. for the case described in https://github.com/qurator-spk/dinglehopper/issues/62 this could reduce the memory usage from around 110k**2 * 3bit = 4.2GB to 110k * 4k * 3bit = 160MB.

maxbachmann avatar Jan 05 '22 12:01 maxbachmann

@mikegerber as a first step I updated the algorithm to use the alignment algorithm by Heikki Hirrö (A Note on Bit-Parallel Alignment Computation.

This improves memory usage by 33% from len(s1) * len(s2) * 3 bit to len(s1) * len(s2) * 2 bit. In addition this makes the algorithm around 10-20% faster.

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:22.97 real,	16.55 user,	7.33 sys,	5013484 mmem

improved to

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:20.78 real,	14.23 user,	7.30 sys,	3353820 mmem

maxbachmann avatar Jan 09 '22 19:01 maxbachmann