bounter icon indicating copy to clipboard operation
bounter copied to clipboard

Why not use xx hash?

Open beviah opened this issue 3 years ago • 7 comments
trafficstars

https://github.com/Cyan4973/xxHash

XXH128 seems to be superior in every way to MurmurHash3_x86_32 currently used.. https://github.com/RaRe-Technologies/bounter/blob/8c83e957b072a50be37221dcec1177fbe0128faf/cbounter/cms_common.c

I am also a bit surprised by the perfect F1 scores given the high collision probability of a 32 bit hash. https://preshing.com/20110504/hash-collision-probabilities/

beviah avatar Feb 23 '22 15:02 beviah

I vaguely remember discussion around alternative hashing schemes in the beginning. I don't remember the details but my guess is murmur won because of availability, portability & familiarity.

Which doesn't mean we couldn't revisit that decision :) Needs a strong contributor though, not only to implement the change but also test it, make sure packaging & distribution continues to work, etc. Are you up for it? What motivated your investigation here?

piskvorky avatar Feb 23 '22 15:02 piskvorky

I don't know C, so would be hard for me to implement it.

I need approximate counters for truly large number of events, billions.. which yields 32bit hashes unusable..

beviah avatar Feb 23 '22 15:02 beviah

Billions doesn't strike me as super large. What is the "business cost" of a collisions – how much do collisions / approximate counts matter?

piskvorky avatar Feb 23 '22 16:02 piskvorky

But it seems that even hundreds of thousands represent an issue for 32bit hash: https://preshing.com/images/small-probabilities.png

There is not much cost to collision as long as the counter does not get saturated and start colliding all the time.. as I assume would happen based on the table from above link.

beviah avatar Feb 23 '22 16:02 beviah

Maybe I wasn't clear enough, I don't mean billions of events, but billions of unique events :) Hundreds of millions at least..

beviah avatar Feb 23 '22 16:02 beviah

As for the motivation, wanted to compare performance with this library: https://github.com/mikrosk/py-probstructs

beviah avatar Feb 23 '22 16:02 beviah

I may have interpreted that table in wrong way... seems in real time I will be getting collisions every few seconds which is not terrible.. and that this rate does not go up... which is counterintuitive actually..

Just did a calculation based on https://en.wikipedia.org/wiki/Birthday_problem#Number_of_people_with_a_shared_birthday

with 100 million unique events there will be 2.3 million collisions. 2.3%

that is still too much for my use case..

n.b. there should be more than 10k collisions in wikipedia dataset.. so not sure how you get 100% F1 score.

beviah avatar Feb 23 '22 20:02 beviah