spark
spark copied to clipboard
[SPARK-40501][SQL] Enhance 'SpecialLimits' to support project(..., limit(...))
What changes were proposed in this pull request?
The pr aim to enhance 'SpecialLimits' to support project(..., limit(...)), for Improve query performance
Why are the changes needed?
When I query a big table(size / per day: 10T, column size: 1219) with limit 1
A.Scenario 1(run sql in spark-sql) - The results will be fetched soon - The optimization of CollectLimitExec has taken effect
1.SQL: select * from xxx where ..._day = '20220919' limit 1
2.Spark UI:
B.Scenario 2(run sql in spark-shell) - It took a long time to fetch out(still running after 20 minutes...)
1.Code: spark.sql("select * from xxx where ..._day = '20220919' limit 1").show()
2.Spark UI:
C.Scenario 3(run sql in spark-shell) - The results will be fetched soon
1.Code: spark.sql("select * from xxx where ..._day = '20220919'").show(1)
2.Spark UI:
The diff between Scenario 2 and Scenario3 is focus on "Optimized Logical Plan"


After pr:
Scenario 2(run sql in spark-shell) - The results will be fetched soon - The optimization of CollectLimitExec has taken effect
1.Code: spark.sql("select * from xxx where ..._day = '20220919' limit 1").show()
2.Spark UI:
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add new UT & Pass GA.
Wait, I'm checking the cause of UT failure
Can one of the admins verify this patch?
Wait, I'm checking the cause of UT failure Done
cc @wangyum @cloud-fan FYI
I'm wondering why PushProjectThroughLimit
does not optimize your query. It should push project through limit.
I'm wondering why
PushProjectThroughLimit
does not optimize your query. It should push project through limit.
Actually, it can complete the above optimization, and pass all GAs
Maybe worriedPushProjectThroughLimit
this would affect other rules,I guess.
Two choice(All of them can achieve the expected effect)
A.add a rule - PushProjectThroughLimit
in the optimizer
B.add new match case in SpecialLimits
PushProjectThroughLimit
is already in the optimizer, or did I miss something?
PushProjectThroughLimit
Hmm..., PushProjectThroughLimit
was added in my first version of pr.
Ah sorry I misread the code. Let's add this rule then. I think it's beneficial, as it kinds of "normalize" the order of project and limit operator, so that we can have more chances to merge adjacent projects or limits.
Ah sorry I misread the code. Let's add this rule then. I think it's beneficial, as it kinds of "normalize" the order of project and limit operator, so that we can have more chances to merge adjacent projects or limits.
Done. @cloud-fan
thanks, merging to master!
thanks, merging to master!
should we backport the changes to branch-3.3 ? @cloud-fan
This is a perf improvement, usually we don't backport.
This is a perf improvement, usually we don't backport.
Okay!