rocksdb icon indicating copy to clipboard operation
rocksdb copied to clipboard

SST files written using SstFileWriter don't work with TtlDB.ingestExternalFiles

Open akshaan opened this issue 4 years ago • 3 comments

When ingestExternalFiles is called on TtlDB with files written using SstFileWriter, the call returns without issue but any subsequent queries to the DB will fail with (often inscrutable) errors. This is because TtlDB requires a timestamp to be appended to each row whereas SST files from SstFileWriter don't have this by default. When querying TtlDB, the get call attempts to strip a timestamp from the end of the fetched value and this causes two kinds of issues:

  • Since the last 32 bytes of the record don't carry a timestamp, interpreting them as a timestamp causes date validation failures
  • If timestamp validation passes by chance, downstream record deserialization breaks because the last 32 bytes of the record have been stripped away

I was wondering if it would be possible to:

  • Add a mode to SstFileWriter that appends timestamps to records when writing
  • Add a validation step to ingestExternalFiles for TtlDB that ensures that the SST files being ingested carry timestamps on each record. (This might be infeasible to check on a per record basis at ingestion time but maybe there's another way?)

akshaan avatar Aug 21 '20 15:08 akshaan

I think it would be nice to validate the SSTs before ingesting them into ttl db. We can encode some information in the properties blocks of SST files, e.g. whether timestamp is present, size of timestamp, etc. During ingestion, we have access to these information with no overhead because we already need to open the SSTs in preparation phase. In addition to encoding timestamps in values, RocksDB can also encode timestamps in keys.

Add a mode to SstFileWriter that appends timestamps to records when writing

Yeah, I think this can be done.

I am going to self-assign this issue for easier tracking. Feel free to assign to yourself if you are interested in following up.

riversand963 avatar Mar 26 '21 04:03 riversand963

+1 on this. This was an unfortunate discovery on the project I am working on. We are building SST files from Redshift data for use in a data pipeline for event enrichment and we wanted to use a TtlDB.

As a workaround we are considering just re-building the database periodically to achieve the same TTL.

dparrella avatar Jan 19 '22 15:01 dparrella

+1. I also found the same issue using Java API. We would like to bulk load data using SST files in a TTL enabled database but that fails

hbraux avatar Jul 29 '22 06:07 hbraux