A little confused about the bloom filter
In the code of bloom.cc, I found that the following code to create bloom filter:
// Compute bloom filter size (in both bits and bytes)
size_t bits = n * bits_per_key_;
// For small n, we can see a very high false positive rate. Fix it
// by enforcing a minimum bloom filter length.
if (bits < 64) bits = 64;
size_t bytes = (bits + 7) / 8;
bits = bytes * 8;
const size_t init_size = dst->size();
dst->resize(init_size + bytes, 0);
dst->push_back(static_cast<char>(k_)); // Remember # of probes in filter
This means that the bloom filter should be at least has a size of 9.
But in the later code in bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const override, I found the following code:
const size_t len = bloom_filter.size();
if (len < 2) return false;
I guess that these are used to test the completeness of the bloom filter. But I wonder that should the length be 9? Or maybe I mistakenly understood them?
I am eager for an answer! And I am appreciate if you can help me! Thanks!
@mumupika I think that snippet is independent of the "enforcing a minimum bloom filter length" part.
if (len < 2) return false;
is based on the observation from the function void CreateFilter(const Slice* keys, int n, std::string* dst) tha a valid resulting bloom_filter string will be at least 2 bytes. (1 byte minimum for some data, 1 byte for the # of probes in k_). So if len < 2 it must mean it's an empty bloom filter, so we can just return false.
But what you proposed, "bloom filter should be at least has a size of 9" looks correct to me in current version. I guess the leveldb folks might design with backward-compatibility in mind. Say what if in previous version there is no such "bumping up to at least 9 bytes" code snippet there, and some old leveldb is created with some bloom filter with just len = 3 or 6 or 7, etc. It still needs to run correctly.
Hello @YukunJ @mumupika, would love to work on this issue.
From my understanding, the CreateFilter method sets a minimum bloom filter size of 64 bits (8 bytes), plus 1 byte to store the number of probes (k_), making the minimum filter size 9 bytes.
However, the KeyMayMatch() function returns false if len < 2. I believe this behavior is meant to support backward compatibility, possibly for filters that were smaller than 9 bytes in earlier versions.
I plan to either
-
Clearly document this behavior in code comments, or
-
Add proper handling and tests for filters with lengths less than 9 bytes, if needed.
Please let me know if it’s okay for me to go ahead with this!