spark
spark copied to clipboard
[SPARK-52171] [SS] StateDataSource join implementation for state v3
What changes were proposed in this pull request?
Add implementation for StateDataSource for state format v3 which uses virtual column families for the 4 join stores. This entails a few changes:
- Inferring schema for for joins needs to take in oldSchemaFilePaths for state format v3.
- sourceOptions need to be modified when the join store name is specified for state format v3, since the name is no longer the store name but the colFamily name. Subsequent metadata checks must also account for this.
- A new joinColFamilyOpt needs to be passed through to the StateReaderInfo, StatePartitionReader, etc so that it can be used to read the correct column family.
Why are the changes needed?
Enable StateDataSource for join version 3.
Does this PR introduce any user-facing change?
Yes. Previously StateDataSource could not be used on checkpoints that use join state version 3, and now it can.
How was this patch tested?
New unit tests and enable disabled unit tests.
Was this patch authored or co-authored using generative AI tooling?
No