spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-40501][SQL] Enhance 'SpecialLimits' to support project(..., limit(...))

Open panbingkun opened this issue 2 years ago • 2 comments

What changes were proposed in this pull request?

The pr aim to enhance 'SpecialLimits' to support project(..., limit(...)), for Improve query performance

Why are the changes needed?

When I query a big table(size / per day: 10T, column size: 1219) with limit 1

A.Scenario 1(run sql in spark-sql) - The results will be fetched soon - The optimization of CollectLimitExec has taken effect

1.SQL: select * from xxx where ..._day = '20220919' limit 1 image 2.Spark UI: image

B.Scenario 2(run sql in spark-shell) - It took a long time to fetch out(still running after 20 minutes...)

1.Code: spark.sql("select * from xxx where ..._day = '20220919' limit 1").show() image 2.Spark UI: image

C.Scenario 3(run sql in spark-shell) - The results will be fetched soon

1.Code: spark.sql("select * from xxx where ..._day = '20220919'").show(1) image 2.Spark UI: image

The diff between Scenario 2 and Scenario3 is focus on "Optimized Logical Plan"

image image

After pr:

Scenario 2(run sql in spark-shell) - The results will be fetched soon - The optimization of CollectLimitExec has taken effect

1.Code: spark.sql("select * from xxx where ..._day = '20220919' limit 1").show() image 2.Spark UI: image

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add new UT & Pass GA.

panbingkun avatar Sep 20 '22 06:09 panbingkun

Wait, I'm checking the cause of UT failure

panbingkun avatar Sep 21 '22 07:09 panbingkun

Can one of the admins verify this patch?

AmplabJenkins avatar Sep 21 '22 07:09 AmplabJenkins

Wait, I'm checking the cause of UT failure Done

panbingkun avatar Sep 22 '22 02:09 panbingkun

cc @wangyum @cloud-fan FYI

LuciferYang avatar Sep 22 '22 05:09 LuciferYang

I'm wondering why PushProjectThroughLimit does not optimize your query. It should push project through limit.

cloud-fan avatar Sep 22 '22 07:09 cloud-fan

I'm wondering why PushProjectThroughLimit does not optimize your query. It should push project through limit.

Actually, it can complete the above optimization, and pass all GAs Maybe worriedPushProjectThroughLimit this would affect other rules,I guess.

Two choice(All of them can achieve the expected effect) A.add a rule - PushProjectThroughLimit in the optimizer B.add new match case in SpecialLimits

panbingkun avatar Sep 22 '22 08:09 panbingkun

PushProjectThroughLimit is already in the optimizer, or did I miss something?

cloud-fan avatar Sep 22 '22 08:09 cloud-fan

PushProjectThroughLimit

Hmm..., PushProjectThroughLimit was added in my first version of pr.

panbingkun avatar Sep 22 '22 08:09 panbingkun

Ah sorry I misread the code. Let's add this rule then. I think it's beneficial, as it kinds of "normalize" the order of project and limit operator, so that we can have more chances to merge adjacent projects or limits.

cloud-fan avatar Sep 22 '22 13:09 cloud-fan

Ah sorry I misread the code. Let's add this rule then. I think it's beneficial, as it kinds of "normalize" the order of project and limit operator, so that we can have more chances to merge adjacent projects or limits.

Done. @cloud-fan

panbingkun avatar Sep 23 '22 04:09 panbingkun

thanks, merging to master!

cloud-fan avatar Sep 26 '22 07:09 cloud-fan

thanks, merging to master!

should we backport the changes to branch-3.3 ? @cloud-fan

panbingkun avatar Sep 26 '22 08:09 panbingkun

This is a perf improvement, usually we don't backport.

cloud-fan avatar Sep 26 '22 08:09 cloud-fan

This is a perf improvement, usually we don't backport.

Okay!

panbingkun avatar Sep 26 '22 08:09 panbingkun