Mash icon indicating copy to clipboard operation
Mash copied to clipboard

Handling collisions

Open kloetzl opened this issue 4 years ago • 0 comments

This might be more of an inquiry that an issue. I am currently interested in how two sketches are compared. Judging from the code the hashes are compared, not the underlying strings. I find that curious as collisions are unlikely but not impossible. Then there is also the following comment in the schema

the k-mers that correspond to the hashes in 'hashes' (in the same order), used mainly for confirming the hash function and not necessarily valid for Jaccard estimates due to potential hash collisions.

I would have guessed that Mash uses minHashes to pick a random but deterministic, representative sample of k-mers and then compares the actual strings. However, that is not how it is implemented. There must be something I am missing.

Hope that someone can clear my confusion.

Also, happy new year! :four_leaf_clover: :fireworks:

kloetzl avatar Dec 31 '19 11:12 kloetzl