libfilter
libfilter copied to clipboard
split block bloom filters implementation
While i am going through some amazing work you are doing , wanted to get your thoughts on some implementation details or caveats
IIUC SBBF is same as in https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h ? Are there any implementation difference or caveats in your particular approach (any significant perf difference) ? Bit confused around terminology here between block
and split block
so wanted to understand better if your implementation is something additional on top of classical blocked bloom filter.
I found your repo through : https://arxiv.org/pdf/2101.01719.pdf , it says for the split block bloom fpp are somewhat fixed due to pre-determined number of hash functions used . In your repo example is see double fpp = 0.0065;
, is this because we are using a fixed number of hashes and bits per value (pre-configured at the given fpp level) . Am i able to tune this for lower fpp by using more hash functions and may be more bits per value ? What is the current default bits per value set in current SBBF implementation ? which parameter is that ?
In the apache kudu project which uses block bloom filter , here : https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h#L66 it says we are able to get 0.1% fpp with 15 bits per value , is it possible to achieve this in your split block BF implementation as well , or are we stuck to 0.4% - 19% range only ,
thanks for bearing with my naive questions as i am new into this area and still trying to figure out best filter for my case where i have strict latency requirement and need to insert ~10-100 million ids , these ids are not uuid but sequential range [0-100M] , so was trying to use roaring bitmaps before but now looking into this probabilistic filters since constructing filters is a potential bottleneck for me .
There is no difference between the SBBF in this repo and Kudu's; I'm the author of both.
You can tune the fpp by setting any space you like; there is no default space bits per value, though there is a fixed hash function parameter of 8 per value, which cannot be tuned. You can use libfilter_block_bytes_needed
or MinSpaceNeeded
or BytesNeeded
if you want to find out how much space is needed for a certain number of values and certain fpp. If you know how much space you need, you can construct a filter with CreateWithBytes
or libfilter_block_init
.