rocksdb
rocksdb copied to clipboard
IngestExternalFile() breaks GetUpdatesSince() due to "gap in sequence numbers"
When taking a database snapshot and then ingesting .sst files with the SstFileWriter/IngestExternalFile() call, RocksDB seems to bump the WAL sequence number because of the newly ingested files.
This has the consequence that tailing the WAL file via GetUpdatesSince() can stop working because there can be gaps in the WAL sequence numbers.
The problem can be reproduced with the following sequence of actions: 0. Grab the initial sequence number
- Put a key into the database. Then calling
GetUpdatesSince()with the initial sequence number works fine. - Take a snapshot
- Ingest an external file. This bumps the sequence number up.
- Release the snapshot
- Put another key into the database.
- Try calling
GetUpdatesSince()again for the initial sequence number. This now returns "Corruption: Gap in sequence numbers".
It seems to work fine when not taking a snapshot, but I am not 100% clear if there aren't more cases in which an external file's ingestion bumps up sequence numbers. The code in db_impl.cc to bump the sequence numbers is:
int consumed_seqno_count =
ingestion_jobs[0].ConsumedSequenceNumbersCount();
for (size_t i = 1; i != num_cfs; ++i) {
consumed_seqno_count =
std::max(consumed_seqno_count,
ingestion_jobs[i].ConsumedSequenceNumbersCount());
}
if (consumed_seqno_count > 0) {
const SequenceNumber last_seqno = versions_->LastSequence();
versions_->SetLastAllocatedSequence(last_seqno + consumed_seqno_count);
versions_->SetLastPublishedSequence(last_seqno + consumed_seqno_count);
versions_->SetLastSequence(last_seqno + consumed_seqno_count);
}
Now that external file ingestion completely destroys WAL tailing via GetUpdatesSince(), I am wondering if I am just missing something obvious, if this is just unsupported or major bug.
Expected behavior
Ingesting external files should not render WAL tailing via GetUpdatesSince() unusable.
Actual behavior
Ingesting external files renders WAL tailing via GetUpdatesSince() unusable, at least when a snapshot has been taken. Probably also under additional conditions.
Steps to reproduce the behavior
Apply the following diff to db/import_column_family_test.cc and run the test.
The test will fail with the "gap in sequence numbers" error when acquiring the snapshot. When commenting out the snapshot lines from the test, the test will work.
diff --git a/db/import_column_family_test.cc b/db/import_column_family_test.cc
index e459e935e..1e0e53281 100644
--- a/db/import_column_family_test.cc
+++ b/db/import_column_family_test.cc
@@ -69,6 +69,102 @@ class ImportColumnFamilyTest : public DBTestBase {
ExportImportFilesMetaData* metadata_ptr_;
};
+TEST_F(ImportColumnFamilyTest, ImportSSTFileWriterFilesAndThenCallGetUpdatesSince) {
+ Options options = CurrentOptions();
+ CreateAndReopenWithCF({"koko"}, options);
+
+ struct WALReplayer : public WriteBatch::Handler {
+ size_t count = 0;
+
+ Status PutCF(uint32_t /*column_family_id*/, const rocksdb::Slice& /*key*/,
+ const rocksdb::Slice& /*value*/) override {
+ ++count;
+ return Status::OK();
+ }
+
+ size_t StealCount() {
+ size_t result = count;
+ count = 0;
+ return result;
+ }
+ };
+
+ auto consume_iterator = [](TransactionLogIterator& iter) -> size_t {
+ WALReplayer replayer;
+
+ EXPECT_OK(iter.status());
+ size_t count = 0;
+ // iterator will fail in here because there are gaps in sequence numbers
+ while (iter.Valid()) {
+ BatchResult batch = iter.GetBatch();
+ Status s = batch.writeBatchPtr->Iterate(&replayer);
+ count += replayer.StealCount();
+ iter.Next();
+ }
+ EXPECT_OK(iter.status());
+ return count;
+ };
+
+ // track sequence number before any data has been added
+ SequenceNumber initial_seqno = db_->GetLatestSequenceNumber();
+
+ // write one key
+ ASSERT_OK(
+ db_->Put(WriteOptions(), handles_[1], "initial key", "initial value"));
+
+ std::unique_ptr<TransactionLogIterator> iter;
+
+ // scan from initial sequence number after inserting first key
+ {
+ Status s = db_->GetUpdatesSince(initial_seqno, &iter);
+ ASSERT_OK(s);
+ ASSERT_EQ(1, consume_iterator(*iter));
+ }
+
+ // not creating this snapshot will make the test succeed!
+ const Snapshot* snapshot = db_->GetSnapshot();
+
+ {
+ // now ingest data using SstFileWriter
+ SstFileWriter sfw_cf(EnvOptions(), options, handles_[1]);
+
+ // cf.sst
+ const std::string cf_sst_name = "cf.sst";
+ const std::string cf_sst = sst_files_dir_ + cf_sst_name;
+ ASSERT_OK(sfw_cf.Open(cf_sst));
+ ASSERT_OK(sfw_cf.Put("ingested key 1", "ingested value 1"));
+ ASSERT_OK(sfw_cf.Put("ingested key 2", "ingested value 2"));
+ ASSERT_OK(sfw_cf.Finish());
+
+ std::vector<std::string> files;
+ files.push_back(cf_sst);
+
+ IngestExternalFileOptions ingestion_options;
+ Status s = db_->IngestExternalFile(handles_[1], files, ingestion_options);
+ ASSERT_OK(s);
+ }
+
+ db_->ReleaseSnapshot(snapshot);
+
+ // scan again from initial sequence number
+ {
+ Status s = db_->GetUpdatesSince(initial_seqno, &iter);
+ ASSERT_OK(s);
+ ASSERT_EQ(1, consume_iterator(*iter));
+ }
+
+ // write one more key
+ ASSERT_OK(
+ db_->Put(WriteOptions(), handles_[1], "another key", "another value"));
+
+ // scan again from initial sequence number
+ {
+ Status s = db_->GetUpdatesSince(initial_seqno, &iter);
+ ASSERT_OK(s);
+ ASSERT_EQ(2, consume_iterator(*iter));
+ }
+}
+
TEST_F(ImportColumnFamilyTest, ImportSSTFileWriterFiles) {
Options options = CurrentOptions();
CreateAndReopenWithCF({"koko"}, options);
Btw, using ingest_options.snapshot_consistency = false; will probably fix this one example test, but still leads to other "gap in sequence numbers" errors in more complex use cases.
The bool strict argument in TransactionLogIteratorImpl::SeekToStartSequence(..., bool strict) does not make much sense to me. It's pretty hard to use because it's difficult for the application to pick a sequence number, pass it to GetUpdatesSince() and expect it to be exactly the starting sequence number of a write batch in the WAL. The reason is that RocksDB can merge write batches from multiple threads writing to the db. Also remember that, for write-committed transactions, a write batch can cover a range of sequence numbers.
Back to the issue here, if snapshot_consistency is true, I think we need to increment the sequence number. It is TransactionLogIterator that needs to be adapted. I can understand the original motivation for checking contiguous sequence numbers, but the assumption has changed since then. We need to update TransactionLogIterator. @siying can correct me if I missing anything.
Assigning myself to follow-up.
@riversand963 : can you already estimate if the changes to TransactionLogIterator can be made in the near/mid-term future, and when? This would greatly help because this issue keeps us from using IngestExternalFile().
Thanks!
@jsteemann We have not finished planning yet. I may not have bandwidth for this until July, but will create a task to track its progress.
@riversand963 : ok, any progress on this issue is highly appreciated from our side! Thanks!
@riversand963 : wondering if a fix for this can be made for one of the upcoming releases 7.6 or 7.7. It would be super helpful because it would enable us to make use of the SstFileWriter. This is currently not an option because the combination of SstFileWriter with PessimisticTransactionDB and GetUpdatesSince() does not seem to work. Thanks!
@riversand963 : any idea if/when this issue can be investigated or fixed? Thanks!
@riversand963: Is there any progress on this?
This issue is still blocking the adoption of SstFileWriter-enabled code, which would unlock large performance benefits for us if using the SstFileWriter and GetUpdatesSince were made compatible. Thanks a lot for considering this!
Hi @riversand963 or anyone else. We would still like to use the SstFileWriter but can't due to this issue. Do you think it can be addressed any time soon? Thanks!
Hi everyone, an update on if/when this can be fixed would be highly appreciated. Thanks!