Support ingesting SST files generated by a live DB
Summary: ... to enable use cases like using RocksDB to merge sort data for ingestion. A new file ingestion option IngestExternalFileOptions::from_live_db is introduced to allows users to ingest SST files generated by live DBs instead of SstFileWriter. For now this only works if the SST files being ingested have zero as their largest sequence number.
Since the feature if not forward-compatible, marked the option as experimental for now in case we want a more explicit opt-in, e.g., introduce MANIFEST versioning and require setting a new MANIFEST version.
Main changes needed to enable this:
- ignore CF id mismatch during ingestion
- ignore the missing external file version table property
- add a new field ignore_seqno_in_file to FileMetaData (and persisted in MANIFEST) to let table reader know to use largest sequence number as the global sequence number for reading the file.
Rest of the change is mostly new unit tests and stress test changes.
A previous attempt is in #5602.
Test plan:
- new unit tests
- stress test: add a variation of TestIngestExternalFile() enabled with
ingest_external_file_one_inthat creates the file in a temporary DB.
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@cbi42 has updated the pull request. You must reimport the pull request before landing.
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@cbi42 has updated the pull request. You must reimport the pull request before landing.
@cbi42 has updated the pull request. You must reimport the pull request before landing.
Since the feature if not forward-compatible, marked the option as experimental for now in case we want a more explicit opt-in, e.g., introduce MANIFEST versioning and require setting a new MANIFEST version.
@cbi42 has updated the pull request. You must reimport the pull request before landing.
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
For now this only works if the SST files being ingested have zero as their largest sequence number ... add a new field ignore_seqno_in_file to FileMetaData (and persisted in MANIFEST) to let table reader know to use largest sequence number as the global sequence number for reading the file.
Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?
Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?
Sorry, by largest largest_seqno, I meant the field FileDescriptor::largest_seqno that is persisted in MANIFEST. This is the global sequence number that's assigned when the file is ingested: https://github.com/facebook/rocksdb/blob/de6d0e5ec3ff81afb36b396d6b286582471ed210/db/external_sst_file_ingestion_job.cc#L471 and can be non-zero.
Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?
Sorry, by largest largest_seqno, I meant the field FileDescriptor::largest_seqno that is persisted in MANIFEST. This is the global sequence number that's assigned when the file is ingested:
https://github.com/facebook/rocksdb/blob/de6d0e5ec3ff81afb36b396d6b286582471ed210/db/external_sst_file_ingestion_job.cc#L471
and can be non-zero.
Got it, thanks. I thought we used to have some trick, like setting smallest_seqno == largest_seqno == global_seqno would tell RocksDB to apply that seqno to all keys. It seems ok because in the rare case where it happens naturally (e.g., a file with one key), applying largest_seqno as a global seqno is not harmful. Do you think it works here?
I thought we used to have some trick, like setting
smallest_seqno == largest_seqno == global_seqnowould tell RocksDB to apply that seqno to all keys. It seems ok because in the rare case where it happens naturally (e.g., a file with one key), applyinglargest_seqnoas a global seqno is not harmful.
No, wait, that's worse for downgrade compatibility, as it'd silently not interpret the largest_seqno as a global seqno.
Something better for downgrade compatibility could be restricting this feature to ingesting with seqno zero.
No, wait, that's worse for downgrade compatibility, as it'd silently not interpret the
largest_seqnoas a global seqno.Something better for downgrade compatibility could be restricting this feature to ingesting with seqno zero.
That makes sense. I guess then we won't need the ignore_seqno_in_file flag since all keys already have seqno 0.
That makes sense. I guess then we won't need the
ignore_seqno_in_fileflag since all keys already have seqno 0.
BTW, Zippy has some plans to use bulk loading to ingest files from another DB into a column family to restore that column family. https://docs.google.com/document/d/1T7RyAD31zaDP-cPXqW7jKpThjyOI8JtbO05CDCNafxM/edit#heading=h.mygkiilv6ocy
One way that I can think of for them to do this is to do multiple bulk loading, one for each of the level of the original column family. For this to work, I think we need to support arbitrary sequence numbers from live DB's files.
That makes sense. I guess then we won't need the
ignore_seqno_in_fileflag since all keys already have seqno 0.BTW, Zippy has some plans to use bulk loading to ingest files from another DB into a column family to restore that column family. https://docs.google.com/document/d/1T7RyAD31zaDP-cPXqW7jKpThjyOI8JtbO05CDCNafxM/edit#heading=h.mygkiilv6ocy
One way that I can think of for them to do this is to do multiple bulk loading, one for each of the level of the original column family. For this to work, I think we need to support arbitrary sequence numbers from live DB's files.
Does CreateColumnFamilyWithImport() work for that use case?
Does CreateColumnFamilyWithImport() work for that use case?
This seems like the closest thing that's already available, for their in place back up use case. I think they just need to do something like create a new column family from importing the backed up files to restore and then drop the old column family altogether. I will check it more and discuss with them.
@cbi42 has updated the pull request. You must reimport the pull request before landing.
@cbi42 has updated the pull request. You must reimport the pull request before landing.
@cbi42 has updated the pull request. You must reimport the pull request before landing.
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@cbi42 has updated the pull request. You must reimport the pull request before landing.
@cbi42 has updated the pull request. You must reimport the pull request before landing.
@cbi42 has updated the pull request. You must reimport the pull request before landing.
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@cbi42 has updated the pull request. You must reimport the pull request before landing.
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@cbi42 merged this pull request in facebook/rocksdb@4384dd5eeec033926c02d57b4bc3996f388927d1.
Hi @cbi42,
This feature looks quite interesting and promising in terms of performance optimization. I hope you don't mind if I have a couple of questions:
Is this feature already fully implemented in the pull request or is it just a prototype? It seems to "allow" users to ingest live SST files, but when I attempted to do so, the keys still could not be read in the new instance/column family, because a normal live SST usually has key sequence numbers larger than 0.
Additionally, I tried to implement this on my own but still encountered issues reading the ingested keys, even after unifying the SST file version and key sequence number. Could there be any restrictions on the reader side, such as a difference in session ID, that might be preventing the keys from being read?
Thank you for your time and assistance!
Hi @wwl2755.
Is this feature already fully implemented in the pull request or is it just a prototype?
This is implemented. It's in "EXPERIMENTAL" state, but should be usable.
when I attempted to do so, the keys still could not be read in the new instance/column family
This is not expected. The unit tests have several example usages. Did the ingestion return OK status?
Could there be any restrictions on the reader side, such as a difference in session ID, that might be preventing the keys from being read?
I don't think there are such restrictions. The main challenge in read path is to use a global sequence number for reading these ingested files, even though keys in such files do not have this sequence number. Requiring all keys to have sequence number = 0 makes the read path (like seeking into data blocks) easier. Also we have this assumption where all keys have unique user_key@seqno. So ingested files should not have duplicated keys since they will be assigned the same sequence number.
Hi @cbi42
Thanks for your information. I tried compactRange() before ingestion and it worked.
Also we have this assumption where all keys have unique user_key@seqno. So ingested files should not have duplicated keys since they will be assigned the same sequence number.
I see. That's why there is a compactRange() in the test case to reduce the duplicate version and set all seqno to 0 by compacting them to the bottommost level.
Thanks again for your time and assistance!