Advice on choosing bloom filter parameters
I have 863,049,256 keys whose format is 34 character alphanumeric:
Z6DauUBtzw8p77MLxy7VWYCKN92JfKiCK
AJe8BJYnJHp9DDUrvGgFwmn5oBjhUSgowr
114VNKGsr9M4ogvwjz6ESNUqYdroGyht7r
etc..
Do you have any recommendations on bloom filter parameters? In particular these:
/* k-mer size */
const unsigned k = //?;
/* number of Bloom filter hash functions */
const unsigned numHashes = //?;
/* size of Bloom filter (in bits) */
const unsigned size = //?;
It takes a long time for me to initialize the filter, so exerpimenting with different configs takes time.
Curious if you have any insight into a good starting configuration.
Oh wait. Does this data structure only work with DNA? haha
The input k-mers are hashed with ntHash. So, they should be nucleotide sequences, which are strings of A, C, G, T, and U.
The Bloom filter can technically be used by itself as BloomFilter.hpp doesn't actually include anything from ntHash. The only weird thing is that we store the k-mer size when the datastructure is serialized but it can technically just be set to anything and not used. The BloomFilter.hpp by itself only uses hash values directly when inserting/querying (doesn't come with its own hash function).
To parameterize the bloom filter I would use this function after you pick a false positive rate (FPR) to get the optimal number of hashes (optimal relative to memory usage). https://github.com/bcgsc/btl_bloomfilter/blob/0f19567306806bc3890e31f3f3d01285b9b4d158/BloomFilter.hpp#L419 And this constructor once you get the recommend number of hash functions. https://github.com/bcgsc/btl_bloomfilter/blob/0f19567306806bc3890e31f3f3d01285b9b4d158/BloomFilter.hpp#L83