strobealign icon indicating copy to clipboard operation
strobealign copied to clipboard

Try other syncmer hash functions

Open marcelm opened this issue 2 years ago • 14 comments

Strobealign currently uses xxh64 to hash syncmers.

We could switch to a different, faster hash function as long as accuracy is at least as good as now.

These are some options for hash functions. The first four are those listed on https://github.com/ksahlin/strobemers/tree/main/randstrobe_implementations:

  1. No hash (2-bit encoded nucleotides)
  2. Thomas Wang hash
  3. xxhash
  4. wyhash
  5. FNV-1a
  6. fxhash
  7. xxh3

When looking at published benchmark results, we need to keep in mind that our use case is hashing a single 64-bit value. Benchmarking results obtained on larger amounts of data may not be representative.

A first glance at the "small data" benchmark by the xxhash authors makes it appear as if FNV-1a ("fnv64") was the slowest, but it is actually quite ok when applied to 8 input bytes. The wyhash benchmark shows wyhash ("wyhash_final3") as being the fastest among the tested ones, but I think it doesn’t show how it would perform on a single 64 bit input number.

There is the great smhasher hash benchmark page. They list xxh3 and wyhash in their summary among the fastest hash functions without quality problems. But they also say:

When used in a hash table the instruction cache will usually beat the CPU and throughput measured here. In my tests the smallest FNV1A beats the fastest crc32_hw1 with Perl 5 hash tables. Even if those worse hash functions will lead to more collisions, the overall speed advantage and inline-ability beats the slightly worse quality.

That’s why I listed FNV-1a above. Its implementation is also super simple, just ~6 lines of code compared to the 5000-line xxhash.h file.

fxhash is a faster variant of FNV-1a that operates on 8 bytes at a time (instead of 1 byte). It is used in the rust compiler. Also, if applied to a single 64 bit value, it is equivalent to an integer multiplication by a special constant, so in our case, its implementation would be even simpler than FNV-1a.

I will measure how runtime changes when switching to the "no hash" variant. That gives us an upper bound for the highest achievable speedup.

marcelm avatar Jan 30 '23 13:01 marcelm

Sounds great.

I will measure how runtime changes when switching to the "no hash" variant. That gives us an upper bound for the highest achievable speedup.

Yes, I like this benchmark.

ksahlin avatar Jan 30 '23 14:01 ksahlin

The potential is ... limited:

variant runtime (s)
no hash 19.20
fxhash 19.30
xxh64 19.30
xxh3 19.30
wyhash 19.35
FNV-1a 19.45

Let’s see how the accuracy changes. It would be great if xxh could be replaced with, for example, fxhash because we would then not have to include a 5000-line header file anymore.

marcelm avatar Jan 30 '23 23:01 marcelm

Wow okay.. My benchmarks where I got 2.5x speedup (from what I recall) with wyhash over xxh3 were probably hashing a struct { uint64_t high; uint64_t low; } int128; (corresponding to a concatenation of two strobes which gives better randomness but much slower). Something with compilation/system might have affected it too (I used -O3 -mavx2).

Yes, fxhash sounds interesting!

ksahlin avatar Jan 31 '23 05:01 ksahlin

To clarify: The test above includes the full strobealign run, so the numbers could be a bit biased if anything done after the hashing also changes (I should have used the dumpstrobes command for benchmarking). But accuracy changes only a little, so I think the results are valid.

marcelm avatar Jan 31 '23 09:01 marcelm

Interesting that accuracy changes (although expected). The question is how much. It would also be dangerous to overfit (non-significant results) for particular genomes.

ksahlin avatar Jan 31 '23 10:01 ksahlin

Here are accuracy measurements for a couple of datasets. (Dmel is Drosophila melanogaster)

hash function Dmel 2x50 Dmel 2x100 Dmel 2x300 CHM13 2x50 CHM13 2x100 CHM13 2x200 CHM13 2x300 rye 2x150
no hash 90.081 92.377 95.359 90.492 93.198 94.386 95.571 90.156
fnv1a 90.077 92.379 95.369 90.458 93.230 94.412 95.640 90.208
fxhash 90.086 92.406 95.346 90.472 93.191 94.374 95.570 90.191
xxh3 90.090 92.357 95.358 90.447 93.213 94.414 95.645 90.198
wyhash 90.084 92.398 95.368 90.465 93.222 94.421 95.642 90.2113
xxh64 (main) 90.064 92.381 95.366 90.465 93.227 94.415 95.626 (not finished)

The differences are small, but it’s interesting to see that our current method is never the best one.

Given these results, I’d tend to just leave this as it is for the moment and focus on other things.

marcelm avatar Jan 31 '23 15:01 marcelm

Nice benchmark!

To be a little complicated: mapping accuracy (if extension alignment) might not give the full picture of hash function efficiency, because a poor hash function may introduce repetitive seeds and thus the need to enter, e.g., rescue mode (affecting runtime). I think using -x would be a bit closer to true performance of only the seed and hash function (although rescue mode exist also on 'mapping-only' level). I have however no reason to suspect any method being particularly poor.

And I agree with focusing on other things.

ksahlin avatar Jan 31 '23 16:01 ksahlin

Since there are so many fallback mechanisms in strobealign (rescue at mapping level and rescue at align level using mate), wouldn't the universally best way to measure hash performance be to simply compute the number of unique seeds in the reference index?

(Not saying we should do this experiment now though. Just leaving the comment here for future.)

ksahlin avatar Jan 31 '23 18:01 ksahlin

Since I still have the log files, this took only a couple of minutes. These are the "Total distinct strobemers" numbers from the logs (did you mean this?).

hash function Dmel 2x50 Dmel 2x100 Dmel 2x300 CHM13 2x50 CHM13 2x100 CHM13 2x200 CHM13 2x300 rye 2x150
nohash 22750563 22971849 23293697 454572197 475234271 493298260 497403512 622910601
fnv1a 22795679 22989336 23320598 463666577 479837017 497138562 500692385 642900512
fxhash 22754003 22968128 23291264 454916301 475312819 493645197 497529305 623290884
xxh3 22798572 22996317 23324759 464126703 480573575 497846756 501118001 646396969
wyhash 22798170 22995566 23324883 464163332 480586570 497778591 501159065 646327459
main 22797906 22995999 23324540 464170595 480551909 497767127 501102347 646351161

marcelm avatar Jan 31 '23 20:01 marcelm

Yes, that is what I meant. I think this may be spot on.

Naturally nohash is worst, which is matching my intuition. Highly conserved regions (with some but few mutations) more likely to get the same seed, but with a good hash function increasing the distance for close-by strings we are more likely to get new seed combinations (in a vague sense 'precision'). Of course, collisions between close-by strings (which is what randstrobes emulates) is also a feature. So it is difficult to asses this in full, but I still think these two statements hold:

  1. An 'as uniform as possible' hash function is good
  2. Uniformity can be estimated by number of distinct seeds with more distinct seeds -> higher uniformity.

If 1 and 2 hold, it looks like xxh3, wyhash, and xxh64 are the top ones.

ksahlin avatar Jan 31 '23 21:01 ksahlin

Coming back to this, currently, we don't hash the s-mer within the k-mer that decides the syncmer to sample. We skip this because of speed. Simply hashing the 'final' syncmer is much faster than hashing every single s-mer we iterate through.

However, hashing the s-mer results in more drastic changes in the number of unique seeds. This does not come as a surprise. Our analysis above showed that no-hash is worse in uniqueness than any other hash function.

I did a small test where I logged the number of unique seeds on Dmel when constructing the s-mers with no-hash (currently used) vs. XXH64. The other seed settings were: k: 20, s: 16, w_min: 4, w_max: 9, and Maximum seed length: 275 (note, they do not correspond to any of the default strobalign settings, however, I tried some different parametrizations too and the effect is still there.)

In summary, there is an ~8% increase in the number of distinct seeds for this parameter setting (detailed results below). Since this is Dmel, I assume the number of distinct seeds will increase even more on rye and other repetitive genomes.

The trade-off is speed. Replacing no-hash with XXH64 makes indexing about 20% slower (a rough estimate on Dmel). But using XXH64 on all s-mers may be overkill. I think that a simple rolling hash function (https://github.com/bcgsc/ntHash or similar) may be nearly as fast as no-hash while still improving the randomness of s-mers.

The number of distinct seeds may not always translate to better mapping results because of downstream fallback mechanisms (e.g., rescue or trying exact alignment to more candidate sites) in strobealign. However, it should, in principle, lead to better mapping. Also, not improving the number of unique seeds because, e.g., rescue is activated less often, is an indirect hack and a poor argument for keeping unnecessary repetitive seeds.

WITH NO-HASH

Total time indexing: 4.99 s
Unique strobemers: 23201999
Total strobemers count: 26458559
Total strobemers occur once: 22442702
Fraction Unique: 0.97
Total strobemers highly abundant > 100: 1700
Total strobemers mid abundance (between 2-100): 757597
Total distinct strobemers stored: 23201999
Ratio distinct to highly abundant: 13648
Ratio distinct to non distinct: 30

WITH XXH64

Total time indexing: 6.21 s
Unique strobemers: 25025203
Total strobemers count: 28498618
Total strobemers occur once: 24212064
Fraction Unique: 0.97
Total strobemers highly abundant > 100: 1813
Total strobemers mid abundance (between 2-100): 811326
Total distinct strobemers stored: 25025203
Ratio distinct to highly abundant: 13803
Ratio distinct to non distinct: 30

ksahlin avatar May 22 '23 15:05 ksahlin

Correction to my previous post: What I observed above was that

(i) the number of sampled seeds goes up by ~8% (ii) The uniqueness improves slightly (Ratio distinct to highly abundant goes up to 13803 from 13648)

Now, the theoretical expected density for open syncmers is 1/(k-s+1) which becomes 0.2 for k=20, s=16. The Dmel genome I use is about 146Mb sequence. From above numbers I get a density of 0.195 (28.5M/146M) for XXH64 and 0.18 for NO-HASH (26.5M/146M). So it could suggest that a more random hash function gives more more random sampling which in turn places the sampled density closer to the expected density.

ksahlin avatar May 22 '23 23:05 ksahlin

Since I discovered XXH_INLINE_ALL (PR #302), I wanted to see whether it can mitigate some of the lost speed when using XXH64 on the s-mer. This seems to be the case: The slowdown is reduced from 17% to 12%. I measured this on Drosophila.

variant runtime
no-hash 9.8 s 100%
XXH64 11.5 s 117%
XXH64 with XXH_INLINE_ALL 10.9 s 112%

I also looked briefly at how accuracy changes, and here is how it looks:

dataset main hashed syncmer
drosophila-50 94.4233 94.8962 (+0.4729)
drosophila-100 99.4656 99.5262 (+0.0606)
drosophila-200 99.7226 99.7272 (+0.0046)
drosophila-300 99.6863 99.6865 (+0.0002)
maize-50 95.3226 95.8508 (+0.5282)
maize-100 99.4867 99.5278 (+0.0411)
maize-200 99.7433 99.7461 (+0.0028)
maize-300 99.6957 99.6956 (-0.0001)
CHM13-50 94.2214 94.9256 (+0.7042)
CHM13-100 99.4495 99.5134 (+0.0639)
CHM13-200 99.732 99.7406 (+0.0086)
CHM13-300 99.693 99.6932 (+0.0002)

So read length 50 benefits most.

marcelm avatar Jun 15 '23 07:06 marcelm

12% increase in seeding speed to 0.5-0.7% increase in accuracy (maybe even more on rye) - seems like a very beneficial tradeoff?

Do you have any idea on how it affects total runtime? I am thinking that slightly more seeds will also affect memory and runtime in other parts of strobealign.

Regardless, this finally provides 'unbiased' sampling of syncmers, which is good also from a theory perspective and future work.

ksahlin avatar Jun 15 '23 09:06 ksahlin