rocksdb Java keyExists() has an incorrect behaviour in most cases

The method keyExists() returns correct result only when data exist in a block cache. It seems that key_exists_helper at rocksjni.cc has a bug due to a side-effect of c++ impl of KeyMayExist which changes the state of read_options at https://github.com/facebook/rocksdb/blob/b26b395e0a15255d322be08110db551976188745/db/db_impl/db_impl.cc#L3765

So consequent call of db->Get(read_opts, ...) looks up only through a block cache tier too:

  const bool may_exist =
      db->KeyMayExist(read_opts, cf_handle, key_slice, &value, &value_found); // <--- here in read_opts.read_tier become kBlockCacheTier

  if (may_exist) {
    ROCKSDB_NAMESPACE::Status s;
    {
      ROCKSDB_NAMESPACE::PinnableSlice pinnable_val;
      s = db->Get(read_opts, cf_handle, key_slice, &pinnable_val);  // <---- here we reuse read_opts with read_tier=kBlockCacheTier
    }

UPDATE: probably my assumption on the root cause is wrong. It seems that keyMayExist() just does not work as expected (and advertised in docs):

... That is to say that this method is probabilistic and may return false positives, but never a false negative.

Actually it returns false negative all the time :( it only returns positive result if the data by key is in a data block cache.

Expected behavior

keyExists() should not produce false negative cases

Actual behavior

keyExists() returns false-negative result if data by a requested key is not in a block cache.

Steps to reproduce the behavior

            db.put("someKey".getBytes(), "someValue".getBytes());
            // ...
            // clear a block cache by restarting the app or any other way
            // ...
            assertTrue(db.keyExists("someKey".getBytes()));

Aug 08 '24 09:08 alim-akbashev

Hello @alim-akbashev and thanks for reaching our community,

I double check code and I didn't find any misbehavior. I also wrote small test to verify your concerns.

The roptions.read_tier = kBlockCacheTier; is not problem. As you can see, on line above we do a copy of ReadOptions. Even the read options passed to the function are marked as cons (const ReadOptions& read_options), so it's impossible to change them.(Unless you do some dark C++ magic).

keyMayExist - it's littble bit confusing, let explain in table :

Return value	Key in database
`true`	Yes or No
`false`	Not

Originally this function used BloomFilter, but today it may be replaced with RibbonFilter or for small databases it may use cache.

Radek

Aug 10 '24 11:08 rhubner

hi @rhubner , thanks for your reply

Yea, seems my first assumption on the reason of the behaviour was wrong.

But i still have concerns regarding the current behaviour.

keyMayExist - it's littble bit confusing, let explain in table :

Return value Key in database true Yes or No false Not

If it would be so, that would not be an issue for me. But unfortunately it does not behave so.

I double check code and I didn't find any misbehavior. I also wrote small test to verify your concerns.

I believe, the test will fail if we add re-opening of a db between put and keyExists/keyMayExist (to clear all kind of possible caches). At least exactly such behaviour i do see using a java code - keyExists/keyMayExist returns false when the key does exist in database. But if I call get(key) first, keyExists/keyMayExist starts returning a correct result true for this key - my guess is that happens because the data by key (or just an index block?) is in a block cache at that moment.

Aug 10 '24 12:08 alim-akbashev

Hi @alim-akbashev ,

believe, the test will fail if we add re-opening of a db between put and keyExists/keyMayExist

That's what I did

Feel free to update the test. Maybe I missed something. I also try to compact DB or flush WAL, but it didn't have any effect on the test.

Radek

Aug 10 '24 12:08 rhubner

yea, i'm so sorry, @rhubner , my bad i didn't double-check the steps-to-reproduce.

it turned out, that to reproduce 1. it's required to fill a db with much more data 2. it's required to restart the app, just closing and opening the db again does not shows the issue. i'm still not sure why it happens and why these conditions are required, this time i won't make any assumptions :)

anyways, i've created a demo projects which 100% reproduces the issue (at least in my environment) https://github.com/alim-akbashev/keyExistsBugDemo

it consist of two mini-apps:

FillDbApp creates a db at /tmp/rocksdb-test, puts a test key + 5M other random keys
TryTheKeyApp calls keyExists() on a test key

hope it helps in investigation.

Aug 10 '24 14:08 alim-akbashev

Hello @alim-akbashev ,

thanks for the example. I can replicate your behavior on my machine.

I think it's the key_maye_exist what return wrong result. But this happen only for TtlDB. For normal RocksDB I'm getting correct results. Maybe different implementations? Let me investigate more.

Radek

Aug 11 '24 10:08 rhubner

Hello @alim-akbashev ,

I did some debbuging over weeked and today and move little bit further. I will be talking primary about DBWithTTLImpl::KeyMayExist

The first call db_->KeyMayExist return true for key which is in DB, but then the second if statement fail. See bellow the output of two calls :

ret : 1
value_found : 1
value.size : 0
SanityCheckTimestamp : Corruption: Error: value's length less than timestamp's
StripTS(value) : Corruption: Bad timestamp in key-value

What's strange, that value_found is 1(true) but there are no data in value. I still don't know why this is happening, but I will continue investigating.

Radek

Aug 12 '24 13:08 rhubner

hello @rhubner , thanks for your efforts.

here's a quick update from my side - I've updated the demo app, now it contains only settings which really affect keyExists/keyMayExist behaviour + now it is a single app: https://github.com/alim-akbashev/keyExistsBugDemo/blob/main/src/main/java/akbashev/rocksdb/MinimalStepsToReproduceApp.java :

the only option which makes keyExists fail is .setCacheIndexAndFilterBlocks(true)
it does not depend on amount of data in a table - now the demo app reproduces the issue with a single record in a db. Millions records were required to trigger a compaction, now it's triggered by the code instead.
checking key existence with another process is not required as well - re-opening with a fresh instance of LRUCache is enough

So now I more or less certain that the behaviour depends on the index and if a db is configured to place indices in a block cache instead of heap (am I right that by default a bloom filter and an index are at heap and they are preloaded during opening a db?), keyMayExist does not trigger loading the corresponding index to a cache as get() does. Probably it's understandable since keyMayExists intended to be a lightweight alternative for get() with zero IO. If so, maybe it's rather an undocumented specific limitation than a bug...

Thanks, Alim

Aug 12 '24 13:08 alim-akbashev

Hello @alim-akbashev ,

I think I found it. In DBImpl::KeyMayExist we always set *value_found = true;.

Then at the end, you can see that the method return true, even when it's status is IsIncomplete.

I think the method should have something like this at the end :

 if(value_found != nullptr && s.IsIncomplete()) {
    *value_found = false;
  }

or line *value_found = true; should be removed.

@pdillinger What do you think about it ? Unfortunately I'm not very familiar with this part of RocksDB codebase. After the proposed change, ttl_test and bloom_test can run without problem.

Radek

cc: @adamretter

Aug 12 '24 14:08 rhubner