rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

Unable to get all records while traversing database

Open hemin1020 opened this issue 2 years ago • 6 comments

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Expected behavior

We have written millions of records into the database, and these records are prefixed with 'xxx' or 'yyy', such as: 'xxx0', 'xxx1', 'yyy0', 'yyy1', etc. In addition, we have customized the comparator to ensure that the records prefixed with 'xxx' are ranked before 'yyy'. In the case of the same prefix, the smaller number is ranked first.

When we iterate in the following way, we should be able to get all the records prefixed with 'xxx':

rocksdb::Iterator* iter = db->NewIterator(rocksdb::ReadOptions());
for (iter->Seek("xxx"); iter->Valid() && iter->key().starts_with("xxx"); iter->Next()) {
    // print key and value
}

Actual behavior

In fact, the records between 'xxx100' and 'xxx1001' are missing from the result of the traversal. But if we specify access to 'xxx105' in the following way, we can get the corresponding value:

std::string value;
rocksdb::Status s = db->Get(rocksdb::ReadOptions(), "xxx105", &value);

Steps to reproduce the behavior

  1. Define the comparator, and write the data starting with prefix, we wrote about four million records.
  2. Traverse the database by seeking prefix.

The above anomalies are occasional.

In addition, is there an easy way to replace the comparator in the ldb tool. Currently, I cannot directly view the database written by myself using ldb. The error message I get is as follows: Failed: Invalid argument: leveldb.BytewiseComparator: does not match existing comparator

hemin1020 avatar Aug 09 '22 08:08 hemin1020

What version of RocksDB are you using?

When you say the anomalies are "occasional", do you mean that only certain keys are missing or that sometimes a given iterator will return the keys and another one might not? If the latter, is compaction/flush happening when the problem is observed?

Are the keys that are missing in the middle of a range (e.g missing xxx103 but say xxx102 and xxx104) or at the beginning or end?

mrambacher avatar Aug 09 '22 13:08 mrambacher

ldb.BytewiseComp

The version we are using is 6.19.

Occasional here means that when we found that the traversed data is partially missing, we have repeatedly tried to reproduce according to the following steps:

  1. Open a brand new database.
  2. Write about 100,000 records to it, of course, these records are also prefixed with 'xxx' and 'yyy'.
  3. Traverse the database by prefix 'xxx'.

Everything works fine when we iterate through the new database. But when it comes to the problematic database, we always lose data between xxx100 and xxx1001. This range has never changed.

These missing keys are in the middle.

hemin1020 avatar Aug 10 '22 07:08 hemin1020

Everything works fine when we iterate through the new database. But when it comes to the problematic database, we always lose data between xxx100 and xxx1001.

does the issue persist after re-open the problematic db?

In addition, is there an easy way to replace the comparator in the ldb tool. Currently, I cannot directly view the database written by myself using ldb. The error message I get is as follows: Failed: Invalid argument: leveldb.BytewiseComparator: does not match existing comparator

as the user defined comparator is code, there's no easy way to provide that in command line (you will have to change the ldb code). But you should still be able to dump each SST content with ldb dump or sst_dump.

jay-zhuang avatar Aug 11 '22 02:08 jay-zhuang

does the issue persist after re-open the problematic db?

Yes,even if we reopen the database for traversal, there is still no data between xxx100 and xxx1001.

as the user defined comparator is code, there's no easy way to provide that in command line (you will have to change the ldb code). But you should still be able to dump each SST content with ldb dump or sst_dump.

We used ldb dump and sst_dump to scan the *.log and *.sst files, and found all the keys.

hemin1020 avatar Aug 11 '22 06:08 hemin1020

The attachment is our data and the test tool we wrote ourselves. For the specific usage method, please refer to README.md. db_related.zip

hemin1020 avatar Aug 15 '22 02:08 hemin1020

@jay-zhuang @ajkr Is there any way to investigate this issue? We are willing to provide more detailed information.

hemin1020 avatar Aug 19 '22 07:08 hemin1020

@hemin1020 : regarding this custom Comparator code you shared...

  int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override {
    // "extentX" < "streamX"
    if(a[0] > b [0]) {
      return -1;
    } else if (a[0] == b[0]) {
      // "extentX" < "extentY"
      // "streamX" < "streamY"
      rocksdb::Slice aa(a);
      rocksdb::Slice bb(b);
      aa.remove_prefix(6);
      bb.remove_prefix(6);
      size_t left = atol(aa.data());
      size_t right = atol(bb.data());
      if (left == right) {
        return 0;
      } else if (left < right) {
        return -1;
      }
      return 1;
    }
    return 1;
  }

There may be a few issues with this code. First of all, atol expects a null-terminated string, but a rocksdb::Slice is not necessarily null-terminated (in the majority of cases, it won't). So the atol may actually read over the bounds of the Slice, causing undefined behavior. Second, there is an unconditional aa.remove_prefix(6) in the comparator. This is fine if all keys are at least 6 bytes long. However, if you are passing xxx as a start key for a Seek operation, it's just 3 bytes. Then the remove_prefix(6) will cause undefined behavior. I can't say for sure that undefined behavior is happening in your particular case, but it may be worth to assert in the comparator that the input Slices a and b are at least 6 bytes long. It will also be useful to add some protective measures for not reading over the bounds with atol. Finally, atol produces a long, which may or may not be large enough to hold the numeric representation of the digts in your keys.

When in doubt, I recommend also to build with ASan/UBSan to check if it reports any such issues. Hope this helps.

jsteemann avatar Sep 27 '22 17:09 jsteemann