strobealign
strobealign copied to clipboard
Try other syncmer hash functions
Strobealign currently uses xxh64 to hash syncmers.
We could switch to a different, faster hash function as long as accuracy is at least as good as now.
These are some options for hash functions. The first four are those listed on https://github.com/ksahlin/strobemers/tree/main/randstrobe_implementations:
When looking at published benchmark results, we need to keep in mind that our use case is hashing a single 64-bit value. Benchmarking results obtained on larger amounts of data may not be representative.
A first glance at the "small data" benchmark by the xxhash authors makes it appear as if FNV-1a ("fnv64") was the slowest, but it is actually quite ok when applied to 8 input bytes. The wyhash benchmark shows wyhash ("wyhash_final3") as being the fastest among the tested ones, but I think it doesn’t show how it would perform on a single 64 bit input number.
There is the great smhasher hash benchmark page. They list xxh3 and wyhash in their summary among the fastest hash functions without quality problems. But they also say:
When used in a hash table the instruction cache will usually beat the CPU and throughput measured here. In my tests the smallest FNV1A beats the fastest crc32_hw1 with Perl 5 hash tables. Even if those worse hash functions will lead to more collisions, the overall speed advantage and inline-ability beats the slightly worse quality.
That’s why I listed FNV-1a above. Its implementation is also super simple, just ~6 lines of code compared to the 5000-line xxhash.h file.
fxhash is a faster variant of FNV-1a that operates on 8 bytes at a time (instead of 1 byte). It is used in the rust compiler. Also, if applied to a single 64 bit value, it is equivalent to an integer multiplication by a special constant, so in our case, its implementation would be even simpler than FNV-1a.
I will measure how runtime changes when switching to the "no hash" variant. That gives us an upper bound for the highest achievable speedup.
Sounds great.
I will measure how runtime changes when switching to the "no hash" variant. That gives us an upper bound for the highest achievable speedup.
Yes, I like this benchmark.
The potential is ... limited:
variant | runtime (s) |
---|---|
no hash | 19.20 |
fxhash | 19.30 |
xxh64 | 19.30 |
xxh3 | 19.30 |
wyhash | 19.35 |
FNV-1a | 19.45 |
Let’s see how the accuracy changes. It would be great if xxh could be replaced with, for example, fxhash because we would then not have to include a 5000-line header file anymore.
Wow okay.. My benchmarks where I got 2.5x speedup (from what I recall) with wyhash over xxh3 were probably hashing a struct { uint64_t high; uint64_t low; } int128;
(corresponding to a concatenation of two strobes which gives better randomness but much slower). Something with compilation/system might have affected it too (I used -O3 -mavx2
).
Yes, fxhash sounds interesting!
To clarify: The test above includes the full strobealign run, so the numbers could be a bit biased if anything done after the hashing also changes (I should have used the dumpstrobes
command for benchmarking). But accuracy changes only a little, so I think the results are valid.
Interesting that accuracy changes (although expected). The question is how much. It would also be dangerous to overfit (non-significant results) for particular genomes.
Here are accuracy measurements for a couple of datasets. (Dmel is Drosophila melanogaster)
hash function | Dmel 2x50 | Dmel 2x100 | Dmel 2x300 | CHM13 2x50 | CHM13 2x100 | CHM13 2x200 | CHM13 2x300 | rye 2x150 |
---|---|---|---|---|---|---|---|---|
no hash | 90.081 | 92.377 | 95.359 | 90.492 | 93.198 | 94.386 | 95.571 | 90.156 |
fnv1a | 90.077 | 92.379 | 95.369 | 90.458 | 93.230 | 94.412 | 95.640 | 90.208 |
fxhash | 90.086 | 92.406 | 95.346 | 90.472 | 93.191 | 94.374 | 95.570 | 90.191 |
xxh3 | 90.090 | 92.357 | 95.358 | 90.447 | 93.213 | 94.414 | 95.645 | 90.198 |
wyhash | 90.084 | 92.398 | 95.368 | 90.465 | 93.222 | 94.421 | 95.642 | 90.2113 |
xxh64 (main) | 90.064 | 92.381 | 95.366 | 90.465 | 93.227 | 94.415 | 95.626 | (not finished) |
The differences are small, but it’s interesting to see that our current method is never the best one.
Given these results, I’d tend to just leave this as it is for the moment and focus on other things.
Nice benchmark!
To be a little complicated: mapping accuracy (if extension alignment) might not give the full picture of hash function efficiency, because a poor hash function may introduce repetitive seeds and thus the need to enter, e.g., rescue mode (affecting runtime). I think using -x
would be a bit closer to true performance of only the seed and hash function (although rescue mode exist also on 'mapping-only' level). I have however no reason to suspect any method being particularly poor.
And I agree with focusing on other things.
Since there are so many fallback mechanisms in strobealign (rescue at mapping level and rescue at align level using mate), wouldn't the universally best way to measure hash performance be to simply compute the number of unique seeds in the reference index?
(Not saying we should do this experiment now though. Just leaving the comment here for future.)
Since I still have the log files, this took only a couple of minutes. These are the "Total distinct strobemers" numbers from the logs (did you mean this?).
hash function | Dmel 2x50 | Dmel 2x100 | Dmel 2x300 | CHM13 2x50 | CHM13 2x100 | CHM13 2x200 | CHM13 2x300 | rye 2x150 |
---|---|---|---|---|---|---|---|---|
nohash | 22750563 | 22971849 | 23293697 | 454572197 | 475234271 | 493298260 | 497403512 | 622910601 |
fnv1a | 22795679 | 22989336 | 23320598 | 463666577 | 479837017 | 497138562 | 500692385 | 642900512 |
fxhash | 22754003 | 22968128 | 23291264 | 454916301 | 475312819 | 493645197 | 497529305 | 623290884 |
xxh3 | 22798572 | 22996317 | 23324759 | 464126703 | 480573575 | 497846756 | 501118001 | 646396969 |
wyhash | 22798170 | 22995566 | 23324883 | 464163332 | 480586570 | 497778591 | 501159065 | 646327459 |
main | 22797906 | 22995999 | 23324540 | 464170595 | 480551909 | 497767127 | 501102347 | 646351161 |
Yes, that is what I meant. I think this may be spot on.
Naturally nohash is worst, which is matching my intuition. Highly conserved regions (with some but few mutations) more likely to get the same seed, but with a good hash function increasing the distance for close-by strings we are more likely to get new seed combinations (in a vague sense 'precision'). Of course, collisions between close-by strings (which is what randstrobes emulates) is also a feature. So it is difficult to asses this in full, but I still think these two statements hold:
- An 'as uniform as possible' hash function is good
- Uniformity can be estimated by number of distinct seeds with more distinct seeds -> higher uniformity.
If 1 and 2 hold, it looks like xxh3, wyhash, and xxh64 are the top ones.
Coming back to this, currently, we don't hash the s-mer within the k-mer that decides the syncmer to sample. We skip this because of speed. Simply hashing the 'final' syncmer is much faster than hashing every single s-mer we iterate through.
However, hashing the s-mer results in more drastic changes in the number of unique seeds. This does not come as a surprise. Our analysis above showed that no-hash is worse in uniqueness than any other hash function.
I did a small test where I logged the number of unique seeds on Dmel when constructing the s-mers with no-hash
(currently used) vs. XXH64
. The other seed settings were: k: 20, s: 16, w_min: 4, w_max: 9, and Maximum seed length: 275 (note, they do not correspond to any of the default strobalign settings, however, I tried some different parametrizations too and the effect is still there.)
In summary, there is an ~8% increase in the number of distinct seeds for this parameter setting (detailed results below). Since this is Dmel, I assume the number of distinct seeds will increase even more on rye and other repetitive genomes.
The trade-off is speed. Replacing no-hash with XXH64 makes indexing about 20% slower (a rough estimate on Dmel). But using XXH64 on all s-mers may be overkill. I think that a simple rolling hash function (https://github.com/bcgsc/ntHash or similar) may be nearly as fast as no-hash
while still improving the randomness of s-mers.
The number of distinct seeds may not always translate to better mapping results because of downstream fallback mechanisms (e.g., rescue or trying exact alignment to more candidate sites) in strobealign. However, it should, in principle, lead to better mapping. Also, not improving the number of unique seeds because, e.g., rescue is activated less often, is an indirect hack and a poor argument for keeping unnecessary repetitive seeds.
WITH NO-HASH
Total time indexing: 4.99 s
Unique strobemers: 23201999
Total strobemers count: 26458559
Total strobemers occur once: 22442702
Fraction Unique: 0.97
Total strobemers highly abundant > 100: 1700
Total strobemers mid abundance (between 2-100): 757597
Total distinct strobemers stored: 23201999
Ratio distinct to highly abundant: 13648
Ratio distinct to non distinct: 30
WITH XXH64
Total time indexing: 6.21 s
Unique strobemers: 25025203
Total strobemers count: 28498618
Total strobemers occur once: 24212064
Fraction Unique: 0.97
Total strobemers highly abundant > 100: 1813
Total strobemers mid abundance (between 2-100): 811326
Total distinct strobemers stored: 25025203
Ratio distinct to highly abundant: 13803
Ratio distinct to non distinct: 30
Correction to my previous post: What I observed above was that
(i) the number of sampled seeds goes up by ~8% (ii) The uniqueness improves slightly (Ratio distinct to highly abundant goes up to 13803 from 13648)
Now, the theoretical expected density for open syncmers is 1/(k-s+1)
which becomes 0.2 for k=20, s=16. The Dmel genome I use is about 146Mb sequence. From above numbers I get a density of 0.195 (28.5M/146M) for XXH64 and 0.18 for NO-HASH (26.5M/146M). So it could suggest that a more random hash function gives more more random sampling which in turn places the sampled density closer to the expected density.
Since I discovered XXH_INLINE_ALL
(PR #302), I wanted to see whether it can mitigate some of the lost speed when using XXH64
on the s-mer. This seems to be the case: The slowdown is reduced from 17% to 12%. I measured this on Drosophila.
variant | runtime | |
---|---|---|
no-hash | 9.8 s | 100% |
XXH64 | 11.5 s | 117% |
XXH64 with XXH_INLINE_ALL |
10.9 s | 112% |
I also looked briefly at how accuracy changes, and here is how it looks:
dataset | main | hashed syncmer |
---|---|---|
drosophila-50 | 94.4233 | 94.8962 (+0.4729) |
drosophila-100 | 99.4656 | 99.5262 (+0.0606) |
drosophila-200 | 99.7226 | 99.7272 (+0.0046) |
drosophila-300 | 99.6863 | 99.6865 (+0.0002) |
maize-50 | 95.3226 | 95.8508 (+0.5282) |
maize-100 | 99.4867 | 99.5278 (+0.0411) |
maize-200 | 99.7433 | 99.7461 (+0.0028) |
maize-300 | 99.6957 | 99.6956 (-0.0001) |
CHM13-50 | 94.2214 | 94.9256 (+0.7042) |
CHM13-100 | 99.4495 | 99.5134 (+0.0639) |
CHM13-200 | 99.732 | 99.7406 (+0.0086) |
CHM13-300 | 99.693 | 99.6932 (+0.0002) |
So read length 50 benefits most.
12% increase in seeding speed to 0.5-0.7% increase in accuracy (maybe even more on rye) - seems like a very beneficial tradeoff?
Do you have any idea on how it affects total runtime? I am thinking that slightly more seeds will also affect memory and runtime in other parts of strobealign.
Regardless, this finally provides 'unbiased' sampling of syncmers, which is good also from a theory perspective and future work.