[spark] Support auto disable bucketed scan

Open ulysses-you opened this issue 1 year ago • 1 comments

Purpose

This pr adds a new rule DisableUnnecessaryPaimonBucketedScan to support auto disable bucketed scan if the bucket scan is not actually effective i.e., there is no shuffle exchange been removed. This change is to avoid performance regression since the bucketed scan may have smaller parallelism than normal scan.

For example: a table with bucket key x but user join/group-by/partition-by on column y.

Note, this rule is inspired from Spark DisableUnnecessaryBucketedScan but work for v2 scan.

Tests

Add test.

API and Format

Documentation

Aug 09 '24 08:08 ulysses-you

It seems spark test failed.

Aug 11 '24 11:08 JingsongLi

@JingsongLi thank you for the reminder, it took me a while to find the root cause...

Aug 12 '24 01:08 ulysses-you

@JingsongLi @YannByron do you have to take a look ? thank you

Aug 12 '24 10:08 ulysses-you

+1 Thanks @ulysses-you for the contribution. Merging...

Aug 16 '24 06:08 JingsongLi