strobealign icon indicating copy to clipboard operation
strobealign copied to clipboard

The index: 32bit hashes instead of 64bit

Open ksahlin opened this issue 2 years ago • 4 comments

Note to developer:

Using 32-bit hashes may reduce both vector and hash table size. For human, it is expected to reduce peak memory from 32Gb (64bit) down to around 20-24Gb

At the final hash value computation h1/2+h2/2 when the seed has been decided, simply hash this 64 bit down to 32 bit with wyhash or XXH32.

This will result in more collisions on large genomes though as there are only 4.2billion unique slots. For human, we store about 450M unique values, but for larger genomes this may be a problem (can this be solved with 32/64 bit template?).

I see quite a reduction in both space and alignment time if we change to 32bit hash value (on a small dataset!).

ksahlin avatar Jun 01 '22 10:06 ksahlin

Turning the hash type into a template parameter would indeed be the way to go.

One question: What happens at the moment when there’s a hash collision? At which time is that resolved?

marcelm avatar Dec 19 '22 11:12 marcelm

As for true randstrobe hash collisions where the underlying strobemers are different, they will be noticed at the NAM stage. Specifically, when parsing out the leftmost and rightmost position of a strobe in the NAM. We do a sanity check that these strobes match the reference exactly. If they don't, it is either of the following:

  1. false symmetrical matches. The paper discusses this as a feature. It is basically that (s1, s2) will get the same hash value as (s2,s1). This is neccessary when masking seeds, and also turns out to be help a tad with sensitivity. See section "Modifications to strobemers" in paper.
  2. A 'true' hash collision from completely different strobes, i.e. (s1+s2)=(s3+s4).
  3. If not 1 or 2, it is some other bug..

In all the above cases, the procedure is to check for case 1 by reversing the NAM and check if it fits the reference, if it does, it was because of false symmetrical matches. This is taken care of in function reverse_nam_if_needed. If this is not the case, give up.

ksahlin avatar Dec 19 '22 12:12 ksahlin

If you mean hash collisions in the actual hash table implementation, I don't know how they are resolved under the hood.

For the ransdtrobe hash collisions, here are some example stats for mapping 10M PE reads to rye, where Did not fit strobe start site are logging these likely hash collisions.

Total mapping sites tried: 100327591
Did not fit strobe start site: 27528

ksahlin avatar Dec 19 '22 12:12 ksahlin

Thanks, right, there was something about slightly better sensitivity, I remember now. I did mean the randstrobe collisions.

The mechanisms for resolving collisions in the hash table are what makes hash table implementations different. A hash table as we use it is just an array (vector). With linear probing, if the hash function h says to use slot h(key) and that one is already filled, slots h(key)+m, h(key)+2m etc. are tried (for some constant m) until an empty slot is found. This has some disadvantages (for example, things tend to cluster and then for each newly insertion that hits a cluster, even more probing is necessary). Some of this can be avoided with quadratic probing, where the probe sequence is h(key), h(key)+1²m, h(key)+2²m, h(key)+3²m etc.

Robin hood hashing is a different method that can move things around on collisions. As I understood it, a cost is associated with each item describing how much probing is necessary to retrieve it. If a new entry needs to be inserted that collides with existing ones, the items are shuffled around, modifying their costs so that one "takes from the rich and gives to the poor" (hence Robin Hood).

marcelm avatar Dec 19 '22 13:12 marcelm