rocksdb
rocksdb copied to clipboard
SST files written using SstFileWriter don't work with TtlDB.ingestExternalFiles
When ingestExternalFiles
is called on TtlDB
with files written using SstFileWriter
, the call returns without issue but any subsequent queries to the DB will fail with (often inscrutable) errors. This is because TtlDB
requires a timestamp to be appended to each row whereas SST files from SstFileWriter
don't have this by default. When querying TtlDB
, the get
call attempts to strip a timestamp from the end of the fetched value and this causes two kinds of issues:
- Since the last 32 bytes of the record don't carry a timestamp, interpreting them as a timestamp causes date validation failures
- If timestamp validation passes by chance, downstream record deserialization breaks because the last 32 bytes of the record have been stripped away
I was wondering if it would be possible to:
- Add a mode to
SstFileWriter
that appends timestamps to records when writing - Add a validation step to
ingestExternalFiles
forTtlDB
that ensures that the SST files being ingested carry timestamps on each record. (This might be infeasible to check on a per record basis at ingestion time but maybe there's another way?)
I think it would be nice to validate the SSTs before ingesting them into ttl db. We can encode some information in the properties blocks of SST files, e.g. whether timestamp is present, size of timestamp, etc. During ingestion, we have access to these information with no overhead because we already need to open the SSTs in preparation phase. In addition to encoding timestamps in values, RocksDB can also encode timestamps in keys.
Add a mode to
SstFileWriter
that appends timestamps to records when writing
Yeah, I think this can be done.
I am going to self-assign this issue for easier tracking. Feel free to assign to yourself if you are interested in following up.
+1 on this. This was an unfortunate discovery on the project I am working on. We are building SST files from Redshift data for use in a data pipeline for event enrichment and we wanted to use a TtlDB
.
As a workaround we are considering just re-building the database periodically to achieve the same TTL.
+1. I also found the same issue using Java API. We would like to bulk load data using SST files in a TTL enabled database but that fails