rocksdb Support ingesting SST files generated by a live DB

Summary: ... to enable use cases like using RocksDB to merge sort data for ingestion. A new file ingestion option IngestExternalFileOptions::from_live_db is introduced to allows users to ingest SST files generated by live DBs instead of SstFileWriter. For now this only works if the SST files being ingested have zero as their largest sequence number.

Since the feature if not forward-compatible, marked the option as experimental for now in case we want a more explicit opt-in, e.g., introduce MANIFEST versioning and require setting a new MANIFEST version.

Main changes needed to enable this:

ignore CF id mismatch during ingestion
ignore the missing external file version table property
add a new field ignore_seqno_in_file to FileMetaData (and persisted in MANIFEST) to let table reader know to use largest sequence number as the global sequence number for reading the file.

Rest of the change is mostly new unit tests and stress test changes.

A previous attempt is in #5602.

Test plan:

new unit tests
stress test: add a variation of TestIngestExternalFile() enabled with ingest_external_file_one_in that creates the file in a temporary DB.

Jun 10 '24 20:06 cbi42

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Jun 11 '24 04:06 facebook-github-bot

@cbi42 has updated the pull request. You must reimport the pull request before landing.

Jun 11 '24 16:06 facebook-github-bot

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Jun 11 '24 17:06 facebook-github-bot

@cbi42 has updated the pull request. You must reimport the pull request before landing.

Jun 14 '24 00:06 facebook-github-bot

@cbi42 has updated the pull request. You must reimport the pull request before landing.

Jun 17 '24 17:06 facebook-github-bot

Since the feature if not forward-compatible, marked the option as experimental for now in case we want a more explicit opt-in, e.g., introduce MANIFEST versioning and require setting a new MANIFEST version.

Jun 17 '24 17:06 cbi42

@cbi42 has updated the pull request. You must reimport the pull request before landing.

Jun 17 '24 22:06 facebook-github-bot

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Jun 17 '24 22:06 facebook-github-bot

For now this only works if the SST files being ingested have zero as their largest sequence number ... add a new field ignore_seqno_in_file to FileMetaData (and persisted in MANIFEST) to let table reader know to use largest sequence number as the global sequence number for reading the file.

Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?

Jul 17 '24 19:07 ajkr

Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?

Sorry, by largest largest_seqno, I meant the field FileDescriptor::largest_seqno that is persisted in MANIFEST. This is the global sequence number that's assigned when the file is ingested: https://github.com/facebook/rocksdb/blob/de6d0e5ec3ff81afb36b396d6b286582471ed210/db/external_sst_file_ingestion_job.cc#L471 and can be non-zero.

Jul 18 '24 05:07 cbi42

Sorry I'm confused. Won't the sequence number be zero regardless of whether we get it from the internal key footer, or from interpreting the largest_seqno as a global sequence number? Is this field meant to prepare for some future scenario where the internal key footer might not have the right sequence number?

Sorry, by largest largest_seqno, I meant the field FileDescriptor::largest_seqno that is persisted in MANIFEST. This is the global sequence number that's assigned when the file is ingested:

https://github.com/facebook/rocksdb/blob/de6d0e5ec3ff81afb36b396d6b286582471ed210/db/external_sst_file_ingestion_job.cc#L471

and can be non-zero.

Got it, thanks. I thought we used to have some trick, like setting smallest_seqno == largest_seqno == global_seqno would tell RocksDB to apply that seqno to all keys. It seems ok because in the rare case where it happens naturally (e.g., a file with one key), applying largest_seqno as a global seqno is not harmful. Do you think it works here?

Jul 18 '24 20:07 ajkr

I thought we used to have some trick, like setting smallest_seqno == largest_seqno == global_seqno would tell RocksDB to apply that seqno to all keys. It seems ok because in the rare case where it happens naturally (e.g., a file with one key), applying largest_seqno as a global seqno is not harmful.

No, wait, that's worse for downgrade compatibility, as it'd silently not interpret the largest_seqno as a global seqno.

Something better for downgrade compatibility could be restricting this feature to ingesting with seqno zero.

Jul 18 '24 20:07 ajkr

No, wait, that's worse for downgrade compatibility, as it'd silently not interpret the largest_seqno as a global seqno.

Something better for downgrade compatibility could be restricting this feature to ingesting with seqno zero.

That makes sense. I guess then we won't need the ignore_seqno_in_file flag since all keys already have seqno 0.

Jul 19 '24 00:07 cbi42

That makes sense. I guess then we won't need the ignore_seqno_in_file flag since all keys already have seqno 0.

BTW, Zippy has some plans to use bulk loading to ingest files from another DB into a column family to restore that column family. https://docs.google.com/document/d/1T7RyAD31zaDP-cPXqW7jKpThjyOI8JtbO05CDCNafxM/edit#heading=h.mygkiilv6ocy

One way that I can think of for them to do this is to do multiple bulk loading, one for each of the level of the original column family. For this to work, I think we need to support arbitrary sequence numbers from live DB's files.

Jul 19 '24 00:07 jowlyzhang

That makes sense. I guess then we won't need the ignore_seqno_in_file flag since all keys already have seqno 0.

BTW, Zippy has some plans to use bulk loading to ingest files from another DB into a column family to restore that column family. https://docs.google.com/document/d/1T7RyAD31zaDP-cPXqW7jKpThjyOI8JtbO05CDCNafxM/edit#heading=h.mygkiilv6ocy

One way that I can think of for them to do this is to do multiple bulk loading, one for each of the level of the original column family. For this to work, I think we need to support arbitrary sequence numbers from live DB's files.

Does CreateColumnFamilyWithImport() work for that use case?

Jul 19 '24 01:07 ajkr

Does CreateColumnFamilyWithImport() work for that use case?

This seems like the closest thing that's already available, for their in place back up use case. I think they just need to do something like create a new column family from importing the backed up files to restore and then drop the old column family altogether. I will check it more and discuss with them.

Jul 19 '24 01:07 jowlyzhang

@cbi42 has updated the pull request. You must reimport the pull request before landing.

Jul 19 '24 04:07 facebook-github-bot

@cbi42 has updated the pull request. You must reimport the pull request before landing.

Jul 19 '24 04:07 facebook-github-bot

@cbi42 has updated the pull request. You must reimport the pull request before landing.

Jul 19 '24 06:07 facebook-github-bot

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Jul 19 '24 06:07 facebook-github-bot

@cbi42 has updated the pull request. You must reimport the pull request before landing.

Jul 19 '24 19:07 facebook-github-bot

@cbi42 has updated the pull request. You must reimport the pull request before landing.

Jul 19 '24 19:07 facebook-github-bot

@cbi42 has updated the pull request. You must reimport the pull request before landing.

Jul 19 '24 19:07 facebook-github-bot

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Jul 19 '24 19:07 facebook-github-bot

@cbi42 has updated the pull request. You must reimport the pull request before landing.

Jul 19 '24 21:07 facebook-github-bot

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Jul 19 '24 21:07 facebook-github-bot

@cbi42 merged this pull request in facebook/rocksdb@4384dd5eeec033926c02d57b4bc3996f388927d1.

Jul 19 '24 23:07 facebook-github-bot

Hi @cbi42,

This feature looks quite interesting and promising in terms of performance optimization. I hope you don't mind if I have a couple of questions:

Is this feature already fully implemented in the pull request or is it just a prototype? It seems to "allow" users to ingest live SST files, but when I attempted to do so, the keys still could not be read in the new instance/column family, because a normal live SST usually has key sequence numbers larger than 0.

Additionally, I tried to implement this on my own but still encountered issues reading the ingested keys, even after unifying the SST file version and key sequence number. Could there be any restrictions on the reader side, such as a difference in session ID, that might be preventing the keys from being read?

Thank you for your time and assistance!

Aug 13 '24 09:08 wwl2755

Hi @wwl2755.

Is this feature already fully implemented in the pull request or is it just a prototype?

This is implemented. It's in "EXPERIMENTAL" state, but should be usable.

when I attempted to do so, the keys still could not be read in the new instance/column family

This is not expected. The unit tests have several example usages. Did the ingestion return OK status?

Could there be any restrictions on the reader side, such as a difference in session ID, that might be preventing the keys from being read?

I don't think there are such restrictions. The main challenge in read path is to use a global sequence number for reading these ingested files, even though keys in such files do not have this sequence number. Requiring all keys to have sequence number = 0 makes the read path (like seeking into data blocks) easier. Also we have this assumption where all keys have unique user_key@seqno. So ingested files should not have duplicated keys since they will be assigned the same sequence number.

Aug 19 '24 21:08 cbi42

Hi @cbi42

Thanks for your information. I tried compactRange() before ingestion and it worked.

Also we have this assumption where all keys have unique user_key@seqno. So ingested files should not have duplicated keys since they will be assigned the same sequence number.

I see. That's why there is a compactRange() in the test case to reduce the duplicate version and set all seqno to 0 by compacting them to the bottommost level.

Thanks again for your time and assistance!

Oct 09 '24 19:10 wwl2755