spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-51831][SQL] Column pruning with existsJoin for Datasource V2

Open jackylee-ch opened this issue 7 months ago • 1 comments

Why are the changes needed?

Recently, I have been testing TPC-DS queries based on DataSource V2, and noticed that column pruning does not occur in scenarios involving EXISTS (SELECT * FROM ... WHERE ...). As a result, the scan ends up reading all columns instead of just the required ones. This issue is reproducible in queries like Q10, Q16, Q35, Q69, and Q94.

This PR introduces PostV2ScanRelationPushdown to address the column pruning issues that may arise after optimizer rules are applied.

Below is the plan changes for the newly added test case. Before this PR

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76b1f4fc-2e84-485c-aade-a62168987baf/t1[id#32L, col1#33L, col2#34L, col3#35L, col4#36L, col5#37L, col6#38L, col7#39L, col8#40L, col9#41L] ParquetScan DataFilters: [isnotnull(col1#33L), (col1#33L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint,col2:bigint,col3:bigint,col4:bigint,col5:bigint,col6:bigint,col7:big... RuntimeFilters: []

After this PR

BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd4b50d9-1643-40e6-a8e1-1429d3213411/t1[id#133L, col1#134L] ParquetScan DataFilters: [isnotnull(col1#134L), (col1#134L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint> RuntimeFilters: []

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Newly added UT.

Was this patch authored or co-authored using generative AI tooling?

No.

jackylee-ch avatar May 29 '25 08:05 jackylee-ch

friendly ping @cloud-fan

LuciferYang avatar Jun 05 '25 04:06 LuciferYang

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Sep 19 '25 00:09 github-actions[bot]

Removed Stale because it represented a genuine existing issue.

LuciferYang avatar Sep 19 '25 05:09 LuciferYang

This approach is too hacky: it caches the scan builder with a tree node tag.

Do you know which rule creates more column pruning oppturnities? We should probably delay the v2 column pruning, or run that rule earlier.

cloud-fan avatar Sep 22 '25 13:09 cloud-fan

We should probably delay the v2 column pruning.

This approach sounds good to me, I would try with this way.

jackylee-ch avatar Sep 23 '25 01:09 jackylee-ch

@cloud-fan CI is green now. Any concerns about the current approach?

jackylee-ch avatar Sep 25 '25 03:09 jackylee-ch

thanks, merging to master!

cloud-fan avatar Sep 25 '25 07:09 cloud-fan

Thanks for your review. @cloud-fan @LuciferYang

jackylee-ch avatar Sep 25 '25 08:09 jackylee-ch

@cloud-fan Do you think we need to backport this PR to branch-4.1 and branch-3.5?

jackylee-ch avatar Sep 25 '25 13:09 jackylee-ch