dablooms
dablooms copied to clipboard
Force counts_per_func to nearest prime > counts_per_func ?
In the paper referenced for the bloom filter hashing algorithm used (https://github.com/bitly/dablooms/blob/master/src/dablooms.c#L185) (Kirsch, Mitzenmacher [2006]) - the recommended algorithm is given by:
gi(x)=h1(x)+ih2(x)(mod p)
From my reading it seems that their theoretical estimate for collision probabilities will only hold provided that counts_per_func is prime. It is also notable that 64 bits of entropy from MurmurHash3_x64_128 is discarded without being used. I'm not sure how significant the difference is on average, but it may make sense to either: a) force counts_per_func to the nearest prime greater than counts_per_func or b) use a triple hashing algorithm taking advantage of the unused entropy provided by murmur hash