leveldb icon indicating copy to clipboard operation
leveldb copied to clipboard

A little confused about the bloom filter

Open mumupika opened this issue 10 months ago • 2 comments

In the code of bloom.cc, I found that the following code to create bloom filter:

    // Compute bloom filter size (in both bits and bytes)
    size_t bits = n * bits_per_key_;

    // For small n, we can see a very high false positive rate.  Fix it
    // by enforcing a minimum bloom filter length.
    if (bits < 64) bits = 64;

    size_t bytes = (bits + 7) / 8;
    bits = bytes * 8;
    const size_t init_size = dst->size();
    dst->resize(init_size + bytes, 0);
    dst->push_back(static_cast<char>(k_));  // Remember # of probes in filter

This means that the bloom filter should be at least has a size of 9. But in the later code in bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const override, I found the following code:

const size_t len = bloom_filter.size();
if (len < 2) return false;

I guess that these are used to test the completeness of the bloom filter. But I wonder that should the length be 9? Or maybe I mistakenly understood them?

I am eager for an answer! And I am appreciate if you can help me! Thanks!

mumupika avatar Mar 11 '25 13:03 mumupika

@mumupika I think that snippet is independent of the "enforcing a minimum bloom filter length" part.

if (len < 2) return false;

is based on the observation from the function void CreateFilter(const Slice* keys, int n, std::string* dst) tha a valid resulting bloom_filter string will be at least 2 bytes. (1 byte minimum for some data, 1 byte for the # of probes in k_). So if len < 2 it must mean it's an empty bloom filter, so we can just return false.

But what you proposed, "bloom filter should be at least has a size of 9" looks correct to me in current version. I guess the leveldb folks might design with backward-compatibility in mind. Say what if in previous version there is no such "bumping up to at least 9 bytes" code snippet there, and some old leveldb is created with some bloom filter with just len = 3 or 6 or 7, etc. It still needs to run correctly.

YukunJ avatar Apr 05 '25 17:04 YukunJ

Hello @YukunJ @mumupika, would love to work on this issue.

From my understanding, the CreateFilter method sets a minimum bloom filter size of 64 bits (8 bytes), plus 1 byte to store the number of probes (k_), making the minimum filter size 9 bytes.

However, the KeyMayMatch() function returns false if len < 2. I believe this behavior is meant to support backward compatibility, possibly for filters that were smaller than 9 bytes in earlier versions.

I plan to either

  • Clearly document this behavior in code comments, or

  • Add proper handling and tests for filters with lengths less than 9 bytes, if needed.

Please let me know if it’s okay for me to go ahead with this!

Gyan-max avatar Apr 07 '25 07:04 Gyan-max