polyleven
polyleven copied to clipboard
Optimize Myers1999 for large dictionary filtering
This is a sketch of an improvement idea I'm thinking to incorporate into polyleven.
How is polyleven used in real life?
- The typical usecase of polyleven is "finding a needle in a haystack"-type tasks.
- e.g. "I want to find out similar DNA sequences to this input in the 1 billion database"
- So the task is essentially a filtering/search problem.
What is the optimization idea?
- As the original paper described, Myers1999 requires a pre-computation of a bit-pattern table.
- This precomputation is actually very expensive -- I think it takes at least 30% of the total computation budget.
- The viable optimization path here is to cache the bit-pattern table.
What does the new interface look like?
- I think the a
map
function fits very well here:>>> polyleven.map("abcde", ["abc", "acde", ...])
How much improvement will wel see?
- The best case is that we'll see 30-40% speed-up with this optimization.
- IOW Polyleven will be able to process ~7 million entries per sec on a consumer-grade CPUs (such as AMD Ryzen 7).
Here is a proof-of-concept of this idea:
- https://ceptord.net/20220515-dna100m.html
To put it short, I could archived the throughput of 14.2 million entries per seconds on my desktop machine with Ryzen 7 4700GE.
Now I need to find a way to port this impvoment to Polyleven...
As a reference here are the improvements I implemented in rapidfuzz:
-
I implemented a bit more specialized hashmap: https://github.com/maxbachmann/rapidfuzz-cpp/blob/main/rapidfuzz/details/PatternMatchVector.hpp This implementation has the following advantages over the one in polyleven: a) for extended ascii, which is likely a large part of the input it comes down to a direct array access b) for long texts of ascii this allows sequential memory access when implementing the blockwise levenshtein distance as described in 2) c) for non ascii text I use a bit more advanced hash collision strategy, which is not going to burn down on input like
"abcdefg" + chr(ord("a") + 128)
-
I have an inverted blockwise implementation to your. You appear to iterate:
for block in blocks:
for ch in s2:
while I use
for ch in s2:
for block in blocks:
as described in 1) for ascii this comes down to sequential memory accesses:
for ch in s2:
uint64_t* block_ptr = PM.get(0, ch)
for block in blocks:
...
block_ptr++
The improvements from 1) and 2) lead to the following performance difference between rapidfuzz and polyleven:
-
I provide a apis similar to the ones you describe here:
rapidfuzz.process.exract
/rapidfuzz.process.extractOne
/rapidfuzz.process.cdist
. In Python these provide the large advantage, that it is no longer required to call into python all the time, which saves a large percentage of the runtime for short strings. -
as you noted this allows you to store the precomputed bit pattern table. For short strings this matters a lot, since it can take upwards 50% of the computation time. For longer strings however this becomes irrelevant.
For short texts (10 characters) the improvements from 3) + 4) lead to a performance improvement of up to 10x depending on the used similarity metric.
- you proposed simd for large texts in #7. However SIMD is especially helpful for short strings. E.g. AVX2 allows you to calculate the similarity for 32 8 character texts in parallel. I did not have the time to implement this for Levenshtein yet, but I implemented this for the Indel distance (only insertions + deletions). Here is the pr https://github.com/maxbachmann/rapidfuzz-cpp/pull/81 with performance comparisions for c++. E.g. for 8 character texts this has the following performance on a laptop cpu from 2017 (i7 8550U): directly calculating similarity: 27 * 10^6 text comparisions per second caching the bitmap: 77 * 10^6 text comparisions per second caching the bitmap + SSE2: 1.25 * 10^9 text comparisions per second caching the bitmap + AVX2: 2.8 * 10^9 text comparisions per second
I hope this helps :) I addition in case you do not have an issue with C++ we might want to combine forces on this, since we have pretty similar implementations and I already implemented a few of the features you would like to incorporate.