rocksdb
rocksdb copied to clipboard
Raw SST File Iterator & reader
Raw SST File Reader for reading tombstone entries from sst file along with sequence number & type of the data in rocksdb
@swamirishi Thank you for this work! Just curious, is the motivation for this class to be able to read the sequence number and type of each entry from a table file? In that case, RocksDB's SstFileDumper
and SstFileReader
can be augmented to surface these two items too. For example, in SstFileDumper
: https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_dumper.cc#L500
ikey
is a parsed internal key, you can get the sequence number and type with ikey.sequence
, ikey.type
. And maybe add some command line option for the sst_dump
tool to conditionally print these two things out.
@swamirishi Thank you for this work! Just curious, is the motivation for this class to be able to read the sequence number and type of each entry from a table file? In that case, RocksDB's
SstFileDumper
andSstFileReader
can be augmented to surface these two items too. For example, inSstFileDumper
:https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_dumper.cc#L500
ikey
is a parsed internal key, you can get the sequence number and type withikey.sequence
,ikey.type
. And maybe add some command line option for thesst_dump
tool to conditionally print these two things out.
@jowlyzhang In Ozone we use rocksdb for metadata store. We implemented snapshots in ozone relying on the rocksdb checkpoint functionality. In order to perform efficient snapshot diffs we currently need all the tombstone entries that are written to the sst files to figure out the keys that have changed over the course of multiple checkpoints. Currently we are patching up the rocksdb code and are building this particular tool and wrote our own jni layer to access the tombstone entries and the sequence number. It would be a really great functionality on the sst file reader. Does this seem like a valid usecase and ask from a feature perspective? I can work on it to augment the sst file reader to be able to do this by adding a flag on read options. You can take a look at this PR to get a better understanding https://github.com/apache/ozone/pull/6182
@swamirishi Thank you for this work! Just curious, is the motivation for this class to be able to read the sequence number and type of each entry from a table file? In that case, RocksDB's
SstFileDumper
andSstFileReader
can be augmented to surface these two items too. For example, inSstFileDumper
:https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_dumper.cc#L500
ikey
is a parsed internal key, you can get the sequence number and type withikey.sequence
,ikey.type
. And maybe add some command line option for thesst_dump
tool to conditionally print these two things out.
Currently the db_iter skips non user keys https://github.com/facebook/rocksdb/blob/003197f0050b8ef3d52d2c291401991a562c773c/db/db_iter.cc#L289, currently the sst file reader is tightly coupled with this.
Currently the db_iter skips non user keys
https://github.com/facebook/rocksdb/blob/003197f0050b8ef3d52d2c291401991a562c773c/db/db_iter.cc#L289
, currently the sst file reader is tightly coupled with this.
sst file reader should be returning a table iterator that iterates the table, not a DBIter: https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_reader.cc#L90
For block based table, this would be a BlockBasedTableIterator
: https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/block_based/block_based_table_iterator.h#L24
This iterator iterates the whole table file, tombstones are surfaced too.
Currently the db_iter skips non user keys https://github.com/facebook/rocksdb/blob/003197f0050b8ef3d52d2c291401991a562c773c/db/db_iter.cc#L289
, currently the sst file reader is tightly coupled with this.
sst file reader should be returning a table iterator that iterates the table, not a DBIter:
https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_reader.cc#L90
For block based table, this would be a
BlockBasedTableIterator
:https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/block_based/block_based_table_iterator.h#L24
This iterator iterates the whole table file, tombstones are surfaced too.
https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_reader.cc#L94 makes it a db_iter
https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/table/sst_file_reader.cc#L94
makes it a db_iter
I see, so you need an iterator to pragmatically iterate the raw sst file to get the tombstone. So you want to define a public iterator class. Would separate tool like sst_dump work for your flow? This tool can be augmented to print out the type and sequence number, you would need to parse its output.
We have a feature to allow users to collect table properties, it sounds like it can work for your use case. You can use it to pragmatically collect the tombstones from each SST file. You can define a TablePropertiesCollectorFactory
https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/include/rocksdb/table_properties.h#L153
This factory is responsible for creating a TablePropertiesCollector
https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/include/rocksdb/table_properties.h#L94
When each SST file is being created, this TablePropertiesCollector::AddUserKey
interface will be invoked with the corresponding user key, entry type, sequence number, etc. https://github.com/facebook/rocksdb/blob/cb4f4381f6d3f00b81e693f602839536261ee5f6/include/rocksdb/table_properties.h#L110-L112
You can define your own TablePropertiesCollector
to collect these tombstone entries. What RocksDB will do is to call the TablePropertiesCollector::GetReadableProperties
interface to get a user defined table properties and persist it to the corresponding SST file. You may want to piggyback this to save this whole info directly in the SST file, or you can just save this info somewhere else. And persist an index to look for this info in the corresponding SST file.
Later on when this SST file needs to be processed, you can use SstFileReader::GetTableProperties
API to get this user defined property and collect those tomstone info.
you want to define a public iterator class. Would separate tool like sst_dump work for your flow? This tool can be augmented to print out the type and sequence number, you would need to parse its output.
We actually wanted a jni which would let us iterate through the sst file and get the key,value, sequenceNumber for each of the records including the tombstone entries from the sst file. We are not particularly looking to get the table properties.
We actually wanted a jni which would let us iterate through the sst file and get the key,value, sequenceNumber for each of the records including the tombstone entries from the sst file. We are not particularly looking to get the table properties.
This function TablePropertiesCollector::AddUserKey
will be invoked for each record with its key, value, sequence number, including tombstone entries for a SST file when it's created. So you are kind of iterating through the whole sst file. The only issue is, within this function, you do not know which SST file this is for. If you just need "all the tombstone entries that are written to the sst files to figure out the keys that have changed over the course of multiple checkpoints", you can collect this info and persist it in a file, say "0011.tombstone_entries.txt". And you can persist a user defined table properties: "tomstone_file_location:0011.tombstone_entries.txt" in the sst file. Later on when you need to apply that logic related to checkpoints, you can get this user defined table properties, and read the corresponding txt file for all the information you need. How you create this file, parse this file, iterate this file will be completely outside of RocksDB. And wouldn't need any RocksDB interface change.
We actually wanted a jni which would let us iterate through the sst file and get the key,value, sequenceNumber for each of the records including the tombstone entries from the sst file. We are not particularly looking to get the table properties.
RocksDB have some implementations that can be leveraged to create such a raw table iterator. Some of the changes in this PR are not required, I have prototyped a simpler change based on these implementations to achieve similar effect. It's at: https://github.com/facebook/rocksdb/compare/main...jowlyzhang:rocksdb:internal_key_interator
Let me know if this looks OK to you, I can work on checking it in.
We actually wanted a jni which would let us iterate through the sst file and get the key,value, sequenceNumber for each of the records including the tombstone entries from the sst file. We are not particularly looking to get the table properties.
RocksDB have some implementations that can be leveraged to create such a raw table iterator. Some of the changes in this PR are not required, I have prototyped a simpler change based on these implementations to achieve similar effect. It's at: main...jowlyzhang:rocksdb:internal_key_interator
Let me know if this looks OK to you, I can work on checking it in.
Yup this change definitely solves my purpose
Hello @swamirishi, I have put together a PR to support raw table iterator in https://github.com/facebook/rocksdb/pull/12385
Please feel free to comment if any functionality is missing. This PR doesn't include the jni wrapper yet. I will add those after we have settled that the functionality added there are sufficient.
Hello @swamirishi, since we have merged https://github.com/facebook/rocksdb/pull/12385, shall we close this PR?
Hello @swamirishi, since we have merged #12385, shall we close this PR?
Yeah this can be closed in favour of #12385