[SPARK-51831][SQL] Column pruning with existsJoin for Datasource V2
Why are the changes needed?
Recently, I have been testing TPC-DS queries based on DataSource V2, and noticed that column pruning does not occur in scenarios involving EXISTS (SELECT * FROM ... WHERE ...). As a result, the scan ends up reading all columns instead of just the required ones. This issue is reproducible in queries like Q10, Q16, Q35, Q69, and Q94.
This PR introduces PostV2ScanRelationPushdown to address the column pruning issues that may arise after optimizer rules are applied.
Below is the plan changes for the newly added test case. Before this PR
BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76b1f4fc-2e84-485c-aade-a62168987baf/t1[id#32L, col1#33L, col2#34L, col3#35L, col4#36L, col5#37L, col6#38L, col7#39L, col8#40L, col9#41L] ParquetScan DataFilters: [isnotnull(col1#33L), (col1#33L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint,col2:bigint,col3:bigint,col4:bigint,col5:bigint,col6:bigint,col7:big... RuntimeFilters: []
After this PR
BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd4b50d9-1643-40e6-a8e1-1429d3213411/t1[id#133L, col1#134L] ParquetScan DataFilters: [isnotnull(col1#134L), (col1#134L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint> RuntimeFilters: []
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Newly added UT.
Was this patch authored or co-authored using generative AI tooling?
No.
friendly ping @cloud-fan
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
Removed Stale because it represented a genuine existing issue.
This approach is too hacky: it caches the scan builder with a tree node tag.
Do you know which rule creates more column pruning oppturnities? We should probably delay the v2 column pruning, or run that rule earlier.
We should probably delay the v2 column pruning.
This approach sounds good to me, I would try with this way.
@cloud-fan CI is green now. Any concerns about the current approach?
thanks, merging to master!
Thanks for your review. @cloud-fan @LuciferYang
@cloud-fan Do you think we need to backport this PR to branch-4.1 and branch-3.5?