rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

IngestExternalFile() breaks GetUpdatesSince() due to "gap in sequence numbers"

Open jsteemann opened this issue 3 years ago • 13 comments

When taking a database snapshot and then ingesting .sst files with the SstFileWriter/IngestExternalFile() call, RocksDB seems to bump the WAL sequence number because of the newly ingested files. This has the consequence that tailing the WAL file via GetUpdatesSince() can stop working because there can be gaps in the WAL sequence numbers.

The problem can be reproduced with the following sequence of actions: 0. Grab the initial sequence number

  1. Put a key into the database. Then calling GetUpdatesSince() with the initial sequence number works fine.
  2. Take a snapshot
  3. Ingest an external file. This bumps the sequence number up.
  4. Release the snapshot
  5. Put another key into the database.
  6. Try calling GetUpdatesSince() again for the initial sequence number. This now returns "Corruption: Gap in sequence numbers".

It seems to work fine when not taking a snapshot, but I am not 100% clear if there aren't more cases in which an external file's ingestion bumps up sequence numbers. The code in db_impl.cc to bump the sequence numbers is:

      int consumed_seqno_count =
          ingestion_jobs[0].ConsumedSequenceNumbersCount();
      for (size_t i = 1; i != num_cfs; ++i) {
        consumed_seqno_count =
            std::max(consumed_seqno_count,
                     ingestion_jobs[i].ConsumedSequenceNumbersCount());
      }    
      if (consumed_seqno_count > 0) { 
        const SequenceNumber last_seqno = versions_->LastSequence();
        versions_->SetLastAllocatedSequence(last_seqno + consumed_seqno_count);
        versions_->SetLastPublishedSequence(last_seqno + consumed_seqno_count);
        versions_->SetLastSequence(last_seqno + consumed_seqno_count);
      }

Now that external file ingestion completely destroys WAL tailing via GetUpdatesSince(), I am wondering if I am just missing something obvious, if this is just unsupported or major bug.

Expected behavior

Ingesting external files should not render WAL tailing via GetUpdatesSince() unusable.

Actual behavior

Ingesting external files renders WAL tailing via GetUpdatesSince() unusable, at least when a snapshot has been taken. Probably also under additional conditions.

Steps to reproduce the behavior

Apply the following diff to db/import_column_family_test.cc and run the test. The test will fail with the "gap in sequence numbers" error when acquiring the snapshot. When commenting out the snapshot lines from the test, the test will work.

diff --git a/db/import_column_family_test.cc b/db/import_column_family_test.cc
index e459e935e..1e0e53281 100644
--- a/db/import_column_family_test.cc
+++ b/db/import_column_family_test.cc
@@ -69,6 +69,102 @@ class ImportColumnFamilyTest : public DBTestBase {
   ExportImportFilesMetaData* metadata_ptr_;
 };
 
+TEST_F(ImportColumnFamilyTest, ImportSSTFileWriterFilesAndThenCallGetUpdatesSince) {
+  Options options = CurrentOptions();
+  CreateAndReopenWithCF({"koko"}, options);
+
+  struct WALReplayer : public WriteBatch::Handler {
+    size_t count = 0;
+
+    Status PutCF(uint32_t /*column_family_id*/, const rocksdb::Slice& /*key*/,
+                 const rocksdb::Slice& /*value*/) override {
+      ++count;
+      return Status::OK();
+    }
+
+    size_t StealCount() {
+      size_t result = count;
+      count = 0;
+      return result;
+    }
+  };
+   
+  auto consume_iterator = [](TransactionLogIterator& iter) -> size_t {
+    WALReplayer replayer;
+
+    EXPECT_OK(iter.status());
+    size_t count = 0;
+    // iterator will fail in here because there are gaps in sequence numbers
+    while (iter.Valid()) {
+      BatchResult batch = iter.GetBatch();
+      Status s = batch.writeBatchPtr->Iterate(&replayer);
+      count += replayer.StealCount();
+      iter.Next();
+    }
+    EXPECT_OK(iter.status());
+    return count;
+  };
+
+  // track sequence number before any data has been added
+  SequenceNumber initial_seqno = db_->GetLatestSequenceNumber();
+
+  // write one key
+  ASSERT_OK(
+      db_->Put(WriteOptions(), handles_[1], "initial key", "initial value"));
+
+  std::unique_ptr<TransactionLogIterator> iter;
+
+  // scan from initial sequence number after inserting first key
+  {
+    Status s = db_->GetUpdatesSince(initial_seqno, &iter);
+    ASSERT_OK(s);
+    ASSERT_EQ(1, consume_iterator(*iter));
+  }
+
+  // not creating this snapshot will make the test succeed!
+  const Snapshot* snapshot = db_->GetSnapshot();
+
+  {
+    // now ingest data using SstFileWriter
+    SstFileWriter sfw_cf(EnvOptions(), options, handles_[1]);
+
+    // cf.sst
+    const std::string cf_sst_name = "cf.sst";
+    const std::string cf_sst = sst_files_dir_ + cf_sst_name;
+    ASSERT_OK(sfw_cf.Open(cf_sst));
+    ASSERT_OK(sfw_cf.Put("ingested key 1", "ingested value 1"));
+    ASSERT_OK(sfw_cf.Put("ingested key 2", "ingested value 2"));
+    ASSERT_OK(sfw_cf.Finish());
+
+    std::vector<std::string> files;
+    files.push_back(cf_sst);
+ 
+    IngestExternalFileOptions ingestion_options;
+    Status s = db_->IngestExternalFile(handles_[1], files, ingestion_options);
+    ASSERT_OK(s);
+  }
+
+  db_->ReleaseSnapshot(snapshot);
+ 
+  // scan again from initial sequence number
+  {
+    Status s = db_->GetUpdatesSince(initial_seqno, &iter);
+    ASSERT_OK(s);
+    ASSERT_EQ(1, consume_iterator(*iter));
+  }
+  
+  // write one more key
+  ASSERT_OK(
+      db_->Put(WriteOptions(), handles_[1], "another key", "another value"));
+  
+  // scan again from initial sequence number
+  {
+    Status s = db_->GetUpdatesSince(initial_seqno, &iter);
+    ASSERT_OK(s);
+    ASSERT_EQ(2, consume_iterator(*iter));
+  }
+}
+
 TEST_F(ImportColumnFamilyTest, ImportSSTFileWriterFiles) {
   Options options = CurrentOptions();
   CreateAndReopenWithCF({"koko"}, options);

jsteemann avatar May 17 '22 11:05 jsteemann

Btw, using ingest_options.snapshot_consistency = false; will probably fix this one example test, but still leads to other "gap in sequence numbers" errors in more complex use cases.

jsteemann avatar May 17 '22 13:05 jsteemann

The bool strict argument in TransactionLogIteratorImpl::SeekToStartSequence(..., bool strict) does not make much sense to me. It's pretty hard to use because it's difficult for the application to pick a sequence number, pass it to GetUpdatesSince() and expect it to be exactly the starting sequence number of a write batch in the WAL. The reason is that RocksDB can merge write batches from multiple threads writing to the db. Also remember that, for write-committed transactions, a write batch can cover a range of sequence numbers.

Back to the issue here, if snapshot_consistency is true, I think we need to increment the sequence number. It is TransactionLogIterator that needs to be adapted. I can understand the original motivation for checking contiguous sequence numbers, but the assumption has changed since then. We need to update TransactionLogIterator. @siying can correct me if I missing anything.

riversand963 avatar May 18 '22 16:05 riversand963

Assigning myself to follow-up.

riversand963 avatar May 18 '22 16:05 riversand963

@riversand963 : can you already estimate if the changes to TransactionLogIterator can be made in the near/mid-term future, and when? This would greatly help because this issue keeps us from using IngestExternalFile(). Thanks!

jsteemann avatar May 31 '22 14:05 jsteemann

@jsteemann We have not finished planning yet. I may not have bandwidth for this until July, but will create a task to track its progress.

riversand963 avatar May 31 '22 20:05 riversand963

@riversand963 : ok, any progress on this issue is highly appreciated from our side! Thanks!

jsteemann avatar Jun 01 '22 14:06 jsteemann

@riversand963 : wondering if a fix for this can be made for one of the upcoming releases 7.6 or 7.7. It would be super helpful because it would enable us to make use of the SstFileWriter. This is currently not an option because the combination of SstFileWriter with PessimisticTransactionDB and GetUpdatesSince() does not seem to work. Thanks!

jsteemann avatar Aug 04 '22 08:08 jsteemann

@riversand963 : any idea if/when this issue can be investigated or fixed? Thanks!

jsteemann avatar Aug 16 '22 16:08 jsteemann

@riversand963: Is there any progress on this?

neunhoef avatar Sep 12 '22 09:09 neunhoef

This issue is still blocking the adoption of SstFileWriter-enabled code, which would unlock large performance benefits for us if using the SstFileWriter and GetUpdatesSince were made compatible. Thanks a lot for considering this!

jsteemann avatar Nov 03 '22 15:11 jsteemann

Hi @riversand963 or anyone else. We would still like to use the SstFileWriter but can't due to this issue. Do you think it can be addressed any time soon? Thanks!

jsteemann avatar Apr 12 '23 16:04 jsteemann

Hi everyone, an update on if/when this can be fixed would be highly appreciated. Thanks!

jsteemann avatar Aug 17 '23 10:08 jsteemann