incubator-gluten icon indicating copy to clipboard operation
incubator-gluten copied to clipboard

[VL] No extra SortExec needed before WindowExec if window type is `sort`

Open WangGuangxin opened this issue 1 year ago • 6 comments

Backend

VL (Velox)

Bug description

When spark.gluten.sql.columnar.backend.velox.window.type is set to sort, there should be no extra SortExec before WindowExec in plan.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

WangGuangxin avatar Jun 25 '24 00:06 WangGuangxin

Can you list such a query as example?

FelixYBW avatar Jun 25 '24 02:06 FelixYBW

In order to remove the duplicated sort operator, we implement the streaming window for spark in velox. Why need to set the sort based window in your use case? Can you share more about your use cases? Thanks.

JkSelf avatar Jun 25 '24 02:06 JkSelf

Can you list such a query as example?

Any window query with following conf set to sort --conf spark.gluten.sql.columnar.backend.velox.window.type=sort

for example (lineitem from tpch)

select row_number() over (partition by l_suppkey order by l_orderkey) from lineitem 

When we set spark.gluten.sql.columnar.backend.velox.window.type=sort, velox will use sort based window builder, so there is no need to sort outside WindowExec operator, right?

image

cc @JkSelf @FelixYBW

WangGuangxin avatar Jun 25 '24 02:06 WangGuangxin

In order to remove the duplicated sort operator, we implement the streaming window for spark in velox. Why need to set the sort based window in your use case? Can you share more about your use cases? Thanks.

@JkSelf We are just doing some investigating. Both SortExec and WindowExec need lots of C2R and R2C within the implemention, especially spill occured. It is hard to do something for Sort outside Window , but for Sort within Window ( aka sort based window), maybe we can eliminate some C2R and C2R.
Do you have any suggestion?

WangGuangxin avatar Jun 25 '24 03:06 WangGuangxin

@WangGuangxin Sort based window is designed for presto backend. And the streaming based window is for spark to remove the sort. Can you use streaming window in your use case? And we also implement the Rowbased streaming window for row_number() function to avoid window oom in gluten. If you choose streaming window, the row streaming window will be applied for row_number() function. Can you have a try? Thanks.

JkSelf avatar Jun 25 '24 03:06 JkSelf

In order to remove the duplicated sort operator, we implement the streaming window for spark in velox. Why need to set the sort based window in your use case? Can you share more about your use cases? Thanks.

@JkSelf We are just doing some investigating. Both SortExec and WindowExec need lots of C2R and R2C within the implemention, especially spill occured. It is hard to do something for Sort outside Window , but for Sort within Window ( aka sort based window), maybe we can eliminate some C2R and C2R. Do you have any suggestion?

Why do you need C2R, R2C?

FelixYBW avatar Jun 25 '24 04:06 FelixYBW