rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

Raw SST File Iterator & reader

Open swamirishi opened this issue 1 year ago • 10 comments

Raw SST File Reader for reading tombstone entries from sst file along with sequence number & type of the data in rocksdb

swamirishi avatar Feb 21 '24 21:02 swamirishi

@swamirishi Thank you for this work! Just curious, is the motivation for this class to be able to read the sequence number and type of each entry from a table file? In that case, RocksDB's SstFileDumper and SstFileReader can be augmented to surface these two items too. For example, in SstFileDumper: https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_dumper.cc#L500

ikey is a parsed internal key, you can get the sequence number and type with ikey.sequence, ikey.type. And maybe add some command line option for the sst_dump tool to conditionally print these two things out.

jowlyzhang avatar Feb 22 '24 18:02 jowlyzhang

@swamirishi Thank you for this work! Just curious, is the motivation for this class to be able to read the sequence number and type of each entry from a table file? In that case, RocksDB's SstFileDumper and SstFileReader can be augmented to surface these two items too. For example, in SstFileDumper:

https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_dumper.cc#L500

ikey is a parsed internal key, you can get the sequence number and type with ikey.sequence, ikey.type. And maybe add some command line option for the sst_dump tool to conditionally print these two things out.

@jowlyzhang In Ozone we use rocksdb for metadata store. We implemented snapshots in ozone relying on the rocksdb checkpoint functionality. In order to perform efficient snapshot diffs we currently need all the tombstone entries that are written to the sst files to figure out the keys that have changed over the course of multiple checkpoints. Currently we are patching up the rocksdb code and are building this particular tool and wrote our own jni layer to access the tombstone entries and the sequence number. It would be a really great functionality on the sst file reader. Does this seem like a valid usecase and ask from a feature perspective? I can work on it to augment the sst file reader to be able to do this by adding a flag on read options. You can take a look at this PR to get a better understanding https://github.com/apache/ozone/pull/6182

swamirishi avatar Feb 22 '24 19:02 swamirishi

@swamirishi Thank you for this work! Just curious, is the motivation for this class to be able to read the sequence number and type of each entry from a table file? In that case, RocksDB's SstFileDumper and SstFileReader can be augmented to surface these two items too. For example, in SstFileDumper:

https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_dumper.cc#L500

ikey is a parsed internal key, you can get the sequence number and type with ikey.sequence, ikey.type. And maybe add some command line option for the sst_dump tool to conditionally print these two things out.

Currently the db_iter skips non user keys https://github.com/facebook/rocksdb/blob/003197f0050b8ef3d52d2c291401991a562c773c/db/db_iter.cc#L289, currently the sst file reader is tightly coupled with this.

swamirishi avatar Feb 22 '24 19:02 swamirishi

Currently the db_iter skips non user keys

https://github.com/facebook/rocksdb/blob/003197f0050b8ef3d52d2c291401991a562c773c/db/db_iter.cc#L289

, currently the sst file reader is tightly coupled with this.

sst file reader should be returning a table iterator that iterates the table, not a DBIter: https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_reader.cc#L90

For block based table, this would be a BlockBasedTableIterator: https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/block_based/block_based_table_iterator.h#L24

This iterator iterates the whole table file, tombstones are surfaced too.

jowlyzhang avatar Feb 22 '24 20:02 jowlyzhang

Currently the db_iter skips non user keys https://github.com/facebook/rocksdb/blob/003197f0050b8ef3d52d2c291401991a562c773c/db/db_iter.cc#L289

, currently the sst file reader is tightly coupled with this.

sst file reader should be returning a table iterator that iterates the table, not a DBIter:

https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_reader.cc#L90

For block based table, this would be a BlockBasedTableIterator:

https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/block_based/block_based_table_iterator.h#L24

This iterator iterates the whole table file, tombstones are surfaced too.

https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_reader.cc#L94 makes it a db_iter

swamirishi avatar Feb 22 '24 20:02 swamirishi

https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_reader.cc#L94

makes it a db_iter

I see, so you need an iterator to pragmatically iterate the raw sst file to get the tombstone. So you want to define a public iterator class. Would separate tool like sst_dump work for your flow? This tool can be augmented to print out the type and sequence number, you would need to parse its output.

We have a feature to allow users to collect table properties, it sounds like it can work for your use case. You can use it to pragmatically collect the tombstones from each SST file. You can define a TablePropertiesCollectorFactory https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/include/rocksdb/table_properties.h#L153

This factory is responsible for creating a TablePropertiesCollector https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/include/rocksdb/table_properties.h#L94

When each SST file is being created, this TablePropertiesCollector::AddUserKey interface will be invoked with the corresponding user key, entry type, sequence number, etc. https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/include/rocksdb/table_properties.h#L110-L112

You can define your own TablePropertiesCollector to collect these tombstone entries. What RocksDB will do is to call the TablePropertiesCollector::GetReadableProperties interface to get a user defined table properties and persist it to the corresponding SST file. You may want to piggyback this to save this whole info directly in the SST file, or you can just save this info somewhere else. And persist an index to look for this info in the corresponding SST file.

Later on when this SST file needs to be processed, you can use SstFileReader::GetTableProperties API to get this user defined property and collect those tomstone info.

jowlyzhang avatar Feb 22 '24 21:02 jowlyzhang

you want to define a public iterator class. Would separate tool like sst_dump work for your flow? This tool can be augmented to print out the type and sequence number, you would need to parse its output.

We actually wanted a jni which would let us iterate through the sst file and get the key,value, sequenceNumber for each of the records including the tombstone entries from the sst file. We are not particularly looking to get the table properties.

swamirishi avatar Feb 22 '24 23:02 swamirishi

We actually wanted a jni which would let us iterate through the sst file and get the key,value, sequenceNumber for each of the records including the tombstone entries from the sst file. We are not particularly looking to get the table properties.

This function TablePropertiesCollector::AddUserKey will be invoked for each record with its key, value, sequence number, including tombstone entries for a SST file when it's created. So you are kind of iterating through the whole sst file. The only issue is, within this function, you do not know which SST file this is for. If you just need "all the tombstone entries that are written to the sst files to figure out the keys that have changed over the course of multiple checkpoints", you can collect this info and persist it in a file, say "0011.tombstone_entries.txt". And you can persist a user defined table properties: "tomstone_file_location:0011.tombstone_entries.txt" in the sst file. Later on when you need to apply that logic related to checkpoints, you can get this user defined table properties, and read the corresponding txt file for all the information you need. How you create this file, parse this file, iterate this file will be completely outside of RocksDB. And wouldn't need any RocksDB interface change.

jowlyzhang avatar Feb 23 '24 00:02 jowlyzhang

We actually wanted a jni which would let us iterate through the sst file and get the key,value, sequenceNumber for each of the records including the tombstone entries from the sst file. We are not particularly looking to get the table properties.

RocksDB have some implementations that can be leveraged to create such a raw table iterator. Some of the changes in this PR are not required, I have prototyped a simpler change based on these implementations to achieve similar effect. It's at: https://github.com/facebook/rocksdb/compare/main...jowlyzhang:rocksdb:internal_key_interator

Let me know if this looks OK to you, I can work on checking it in.

jowlyzhang avatar Feb 23 '24 18:02 jowlyzhang

We actually wanted a jni which would let us iterate through the sst file and get the key,value, sequenceNumber for each of the records including the tombstone entries from the sst file. We are not particularly looking to get the table properties.

RocksDB have some implementations that can be leveraged to create such a raw table iterator. Some of the changes in this PR are not required, I have prototyped a simpler change based on these implementations to achieve similar effect. It's at: main...jowlyzhang:rocksdb:internal_key_interator

Let me know if this looks OK to you, I can work on checking it in.

Yup this change definitely solves my purpose

swamirishi avatar Feb 24 '24 01:02 swamirishi

Hello @swamirishi, I have put together a PR to support raw table iterator in https://github.com/facebook/rocksdb/pull/12385

Please feel free to comment if any functionality is missing. This PR doesn't include the jni wrapper yet. I will add those after we have settled that the functionality added there are sufficient.

jowlyzhang avatar Feb 26 '24 22:02 jowlyzhang

Hello @swamirishi, since we have merged https://github.com/facebook/rocksdb/pull/12385, shall we close this PR?

jowlyzhang avatar Apr 03 '24 16:04 jowlyzhang

Hello @swamirishi, since we have merged #12385, shall we close this PR?

Yeah this can be closed in favour of #12385

swamirishi avatar Apr 05 '24 00:04 swamirishi