strobealign Try other syncmer hash functions

Strobealign currently uses xxh64 to hash syncmers.

We could switch to a different, faster hash function as long as accuracy is at least as good as now.

These are some options for hash functions. The first four are those listed on https://github.com/ksahlin/strobemers/tree/main/randstrobe_implementations:

No hash (2-bit encoded nucleotides)
Thomas Wang hash
xxhash
wyhash
FNV-1a
fxhash
xxh3

When looking at published benchmark results, we need to keep in mind that our use case is hashing a single 64-bit value. Benchmarking results obtained on larger amounts of data may not be representative.

A first glance at the "small data" benchmark by the xxhash authors makes it appear as if FNV-1a ("fnv64") was the slowest, but it is actually quite ok when applied to 8 input bytes. The wyhash benchmark shows wyhash ("wyhash_final3") as being the fastest among the tested ones, but I think it doesn’t show how it would perform on a single 64 bit input number.

There is the great smhasher hash benchmark page. They list xxh3 and wyhash in their summary among the fastest hash functions without quality problems. But they also say:

When used in a hash table the instruction cache will usually beat the CPU and throughput measured here. In my tests the smallest FNV1A beats the fastest crc32_hw1 with Perl 5 hash tables. Even if those worse hash functions will lead to more collisions, the overall speed advantage and inline-ability beats the slightly worse quality.

That’s why I listed FNV-1a above. Its implementation is also super simple, just ~6 lines of code compared to the 5000-line xxhash.h file.

fxhash is a faster variant of FNV-1a that operates on 8 bytes at a time (instead of 1 byte). It is used in the rust compiler. Also, if applied to a single 64 bit value, it is equivalent to an integer multiplication by a special constant, so in our case, its implementation would be even simpler than FNV-1a.

I will measure how runtime changes when switching to the "no hash" variant. That gives us an upper bound for the highest achievable speedup.

Jan 30 '23 13:01 marcelm

Sounds great.

I will measure how runtime changes when switching to the "no hash" variant. That gives us an upper bound for the highest achievable speedup.

Yes, I like this benchmark.

Jan 30 '23 14:01 ksahlin

The potential is ... limited:

variant	runtime (s)
no hash	19.20
fxhash	19.30
xxh64	19.30
xxh3	19.30
wyhash	19.35
FNV-1a	19.45

Let’s see how the accuracy changes. It would be great if xxh could be replaced with, for example, fxhash because we would then not have to include a 5000-line header file anymore.

Jan 30 '23 23:01 marcelm

Wow okay.. My benchmarks where I got 2.5x speedup (from what I recall) with wyhash over xxh3 were probably hashing a struct { uint64_t high; uint64_t low; } int128; (corresponding to a concatenation of two strobes which gives better randomness but much slower). Something with compilation/system might have affected it too (I used -O3 -mavx2).

Yes, fxhash sounds interesting!

Jan 31 '23 05:01 ksahlin

To clarify: The test above includes the full strobealign run, so the numbers could be a bit biased if anything done after the hashing also changes (I should have used the dumpstrobes command for benchmarking). But accuracy changes only a little, so I think the results are valid.

Jan 31 '23 09:01 marcelm

Interesting that accuracy changes (although expected). The question is how much. It would also be dangerous to overfit (non-significant results) for particular genomes.

Jan 31 '23 10:01 ksahlin

Here are accuracy measurements for a couple of datasets. (Dmel is Drosophila melanogaster)

hash function	Dmel 2x50	Dmel 2x100	Dmel 2x300	CHM13 2x50	CHM13 2x100	CHM13 2x200	CHM13 2x300	rye 2x150
no hash	90.081	92.377	95.359	90.492	93.198	94.386	95.571	90.156
fnv1a	90.077	92.379	95.369	90.458	93.230	94.412	95.640	90.208
fxhash	90.086	92.406	95.346	90.472	93.191	94.374	95.570	90.191
xxh3	90.090	92.357	95.358	90.447	93.213	94.414	95.645	90.198
wyhash	90.084	92.398	95.368	90.465	93.222	94.421	95.642	90.2113
xxh64 (main)	90.064	92.381	95.366	90.465	93.227	94.415	95.626	(not finished)

The differences are small, but it’s interesting to see that our current method is never the best one.

Given these results, I’d tend to just leave this as it is for the moment and focus on other things.

Jan 31 '23 15:01 marcelm

Nice benchmark!

To be a little complicated: mapping accuracy (if extension alignment) might not give the full picture of hash function efficiency, because a poor hash function may introduce repetitive seeds and thus the need to enter, e.g., rescue mode (affecting runtime). I think using -x would be a bit closer to true performance of only the seed and hash function (although rescue mode exist also on 'mapping-only' level). I have however no reason to suspect any method being particularly poor.

And I agree with focusing on other things.

Jan 31 '23 16:01 ksahlin

Since there are so many fallback mechanisms in strobealign (rescue at mapping level and rescue at align level using mate), wouldn't the universally best way to measure hash performance be to simply compute the number of unique seeds in the reference index?

(Not saying we should do this experiment now though. Just leaving the comment here for future.)

Jan 31 '23 18:01 ksahlin

Since I still have the log files, this took only a couple of minutes. These are the "Total distinct strobemers" numbers from the logs (did you mean this?).

hash function	Dmel 2x50	Dmel 2x100	Dmel 2x300	CHM13 2x50	CHM13 2x100	CHM13 2x200	CHM13 2x300	rye 2x150
nohash	22750563	22971849	23293697	454572197	475234271	493298260	497403512	622910601
fnv1a	22795679	22989336	23320598	463666577	479837017	497138562	500692385	642900512
fxhash	22754003	22968128	23291264	454916301	475312819	493645197	497529305	623290884
xxh3	22798572	22996317	23324759	464126703	480573575	497846756	501118001	646396969
wyhash	22798170	22995566	23324883	464163332	480586570	497778591	501159065	646327459
main	22797906	22995999	23324540	464170595	480551909	497767127	501102347	646351161

Jan 31 '23 20:01 marcelm

Yes, that is what I meant. I think this may be spot on.

Naturally nohash is worst, which is matching my intuition. Highly conserved regions (with some but few mutations) more likely to get the same seed, but with a good hash function increasing the distance for close-by strings we are more likely to get new seed combinations (in a vague sense 'precision'). Of course, collisions between close-by strings (which is what randstrobes emulates) is also a feature. So it is difficult to asses this in full, but I still think these two statements hold:

An 'as uniform as possible' hash function is good
Uniformity can be estimated by number of distinct seeds with more distinct seeds -> higher uniformity.

If 1 and 2 hold, it looks like xxh3, wyhash, and xxh64 are the top ones.

Jan 31 '23 21:01 ksahlin

Coming back to this, currently, we don't hash the s-mer within the k-mer that decides the syncmer to sample. We skip this because of speed. Simply hashing the 'final' syncmer is much faster than hashing every single s-mer we iterate through.

However, hashing the s-mer results in more drastic changes in the number of unique seeds. This does not come as a surprise. Our analysis above showed that no-hash is worse in uniqueness than any other hash function.

I did a small test where I logged the number of unique seeds on Dmel when constructing the s-mers with no-hash (currently used) vs. XXH64. The other seed settings were: k: 20, s: 16, w_min: 4, w_max: 9, and Maximum seed length: 275 (note, they do not correspond to any of the default strobalign settings, however, I tried some different parametrizations too and the effect is still there.)

In summary, there is an ~8% increase in the number of distinct seeds for this parameter setting (detailed results below). Since this is Dmel, I assume the number of distinct seeds will increase even more on rye and other repetitive genomes.

The trade-off is speed. Replacing no-hash with XXH64 makes indexing about 20% slower (a rough estimate on Dmel). But using XXH64 on all s-mers may be overkill. I think that a simple rolling hash function (https://github.com/bcgsc/ntHash or similar) may be nearly as fast as no-hash while still improving the randomness of s-mers.

The number of distinct seeds may not always translate to better mapping results because of downstream fallback mechanisms (e.g., rescue or trying exact alignment to more candidate sites) in strobealign. However, it should, in principle, lead to better mapping. Also, not improving the number of unique seeds because, e.g., rescue is activated less often, is an indirect hack and a poor argument for keeping unnecessary repetitive seeds.

WITH NO-HASH

Total time indexing: 4.99 s
Unique strobemers: 23201999
Total strobemers count: 26458559
Total strobemers occur once: 22442702
Fraction Unique: 0.97
Total strobemers highly abundant > 100: 1700
Total strobemers mid abundance (between 2-100): 757597
Total distinct strobemers stored: 23201999
Ratio distinct to highly abundant: 13648
Ratio distinct to non distinct: 30

WITH XXH64

Total time indexing: 6.21 s
Unique strobemers: 25025203
Total strobemers count: 28498618
Total strobemers occur once: 24212064
Fraction Unique: 0.97
Total strobemers highly abundant > 100: 1813
Total strobemers mid abundance (between 2-100): 811326
Total distinct strobemers stored: 25025203
Ratio distinct to highly abundant: 13803
Ratio distinct to non distinct: 30

May 22 '23 15:05 ksahlin

Correction to my previous post: What I observed above was that

(i) the number of sampled seeds goes up by ~8% (ii) The uniqueness improves slightly (Ratio distinct to highly abundant goes up to 13803 from 13648)

Now, the theoretical expected density for open syncmers is 1/(k-s+1) which becomes 0.2 for k=20, s=16. The Dmel genome I use is about 146Mb sequence. From above numbers I get a density of 0.195 (28.5M/146M) for XXH64 and 0.18 for NO-HASH (26.5M/146M). So it could suggest that a more random hash function gives more more random sampling which in turn places the sampled density closer to the expected density.

May 22 '23 23:05 ksahlin

Since I discovered XXH_INLINE_ALL (PR #302), I wanted to see whether it can mitigate some of the lost speed when using XXH64 on the s-mer. This seems to be the case: The slowdown is reduced from 17% to 12%. I measured this on Drosophila.

variant	runtime
no-hash	9.8 s	100%
XXH64	11.5 s	117%
XXH64 with `XXH_INLINE_ALL`	10.9 s	112%

I also looked briefly at how accuracy changes, and here is how it looks:

dataset	main	hashed syncmer
drosophila-50	94.4233	94.8962 (+0.4729)
drosophila-100	99.4656	99.5262 (+0.0606)
drosophila-200	99.7226	99.7272 (+0.0046)
drosophila-300	99.6863	99.6865 (+0.0002)
maize-50	95.3226	95.8508 (+0.5282)
maize-100	99.4867	99.5278 (+0.0411)
maize-200	99.7433	99.7461 (+0.0028)
maize-300	99.6957	99.6956 (-0.0001)
CHM13-50	94.2214	94.9256 (+0.7042)
CHM13-100	99.4495	99.5134 (+0.0639)
CHM13-200	99.732	99.7406 (+0.0086)
CHM13-300	99.693	99.6932 (+0.0002)

So read length 50 benefits most.

Jun 15 '23 07:06 marcelm

12% increase in seeding speed to 0.5-0.7% increase in accuracy (maybe even more on rye) - seems like a very beneficial tradeoff?

Do you have any idea on how it affects total runtime? I am thinking that slightly more seeds will also affect memory and runtime in other parts of strobealign.

Regardless, this finally provides 'unbiased' sampling of syncmers, which is good also from a theory perspective and future work.

Jun 15 '23 09:06 ksahlin

strobealign strobealign copied to clipboard

Try other syncmer hash functions

strobealign
strobealign copied to clipboard