[SPARK-48589][SQL][SS] Add option snapshotStartBatchId and snapshotPartitionId to state data source
What changes were proposed in this pull request?
This PR defines two new options, snapshotStartBatchId and snapshotPartitionId, for the existing state reader. Both of them should be provided at the same time.
- When there is no snapshot file at
snapshotStartBatch(note there is an off-by-one issue between version and batch Id), throw an exception. - Otherwise, the reader should continue to rebuild the state by reading delta files only, and ignore all snapshot files afterwards.
- Note that if a
batchIdoption is already specified. That batchId is the ending batchId, we should then end at that batchId. - This feature supports state generated by HDFS state store provider and RocksDB state store provider with changelog checkpointing enabled. It does not support RocksDB with changelog disabled which is the default for RocksDB.
Why are the changes needed?
Sometimes when a snapshot is corrupted, users want to bypass it when reading a later state. This PR gives user ability to specify the starting snapshot version and partition. This feature can be useful for debugging purpose.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Created test cases for testing edge cases for the input of new options. Created test for the new public function replayReadStateFromSnapshot. Created integration test for the new options against four stateful operators: limit, aggregation, deduplication, stream-stream join. Instead of generating states within the tests which is unstable, I prepare golden files for the integration test.
Was this patch authored or co-authored using generative AI tooling?
No.
Is there necessity to add an end-to-end test for the options? If so, I can create another PR. The way to construct it is probably by sleeping for a sufficiently long time for maintenance task to run. @anishshri-db @HeartSaVioR
@WweiL Tagging myself so it shows on my dashboard
Failure seems related to the error class related changes ?
As a summary to the above conversation, I will refactor the code based on the guideline that the APIs added by this feature should not be used by streaming queries. Therefore, it should be isolated from current code in terms of function names and logics.
Thanks for all the careful checks by @HeartSaVioR @anishshri-db @WweiL. This PR is ready to merge.
Thanks! Merging to master.