xxhash
xxhash copied to clipboard
Library creates conflicts for very short strings
I just ran into an issue with some example data where I got like 64 conflicting entries in a sample of about 810.000. This seems to be way too much for what I expected so my questions:
- Is there a minimum string size for this hash function?
- Is there a general problem with the algorithm for very short strings?
Here one of my examples with conflicting identical hashes for different strings:
const hash1 = xxHash32('10-10020567-549308', 0).toString(16) // --> b25eede2
const hash2 = xxHash32('20-20384387-548563', 0).toString(16) // --> b25eede2
@Trinidus,
That is to be expected given that the hash is only 32 bits.
Here are a few discussions:
- https://github.com/Cyan4973/xxHash/issues/165
- https://xxhash.com/
- https://stackoverflow.com/questions/45788721/where-can-i-find-xxhash64-and-md5-collision-probability-statistics
Given that xxHash32 is only 32bit, the probability becomes something like:
p(k) = 1 - exp(-k(k-1) / (2 * 2^32))
From the graph, anything over 100k values will have more than a 50% chance of a collision.