bounter
bounter copied to clipboard
Why not use xx hash?
https://github.com/Cyan4973/xxHash
XXH128 seems to be superior in every way to MurmurHash3_x86_32 currently used.. https://github.com/RaRe-Technologies/bounter/blob/8c83e957b072a50be37221dcec1177fbe0128faf/cbounter/cms_common.c
I am also a bit surprised by the perfect F1 scores given the high collision probability of a 32 bit hash. https://preshing.com/20110504/hash-collision-probabilities/
I vaguely remember discussion around alternative hashing schemes in the beginning. I don't remember the details but my guess is murmur won because of availability, portability & familiarity.
Which doesn't mean we couldn't revisit that decision :) Needs a strong contributor though, not only to implement the change but also test it, make sure packaging & distribution continues to work, etc. Are you up for it? What motivated your investigation here?
I don't know C, so would be hard for me to implement it.
I need approximate counters for truly large number of events, billions.. which yields 32bit hashes unusable..
Billions doesn't strike me as super large. What is the "business cost" of a collisions – how much do collisions / approximate counts matter?
But it seems that even hundreds of thousands represent an issue for 32bit hash: https://preshing.com/images/small-probabilities.png
There is not much cost to collision as long as the counter does not get saturated and start colliding all the time.. as I assume would happen based on the table from above link.
Maybe I wasn't clear enough, I don't mean billions of events, but billions of unique events :) Hundreds of millions at least..
As for the motivation, wanted to compare performance with this library: https://github.com/mikrosk/py-probstructs
I may have interpreted that table in wrong way... seems in real time I will be getting collisions every few seconds which is not terrible.. and that this rate does not go up... which is counterintuitive actually..
Just did a calculation based on https://en.wikipedia.org/wiki/Birthday_problem#Number_of_people_with_a_shared_birthday
with 100 million unique events there will be 2.3 million collisions. 2.3%
that is still too much for my use case..
n.b. there should be more than 10k collisions in wikipedia dataset.. so not sure how you get 100% F1 score.