datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Support RangePartitioning with native shuffle

Open andygrove opened this issue 1 year ago • 7 comments

What is the problem the feature request solves?

We do not currently support RangePartitioning with native shuffle.

Adding this support would allow us to use native shuffle for more queries, including queries where there is a global sort.

Describe the potential solution

No response

Additional context

No response

andygrove avatar May 22 '24 15:05 andygrove

Since the early development of Comet shuffle, this is one item on to do list. After we finish columnar shuffle which has higher coverage and support main partitioning types, we turn our focus to other more urgent items. Because we want to increase query coverage as much as possible. Currently we can run TPC-H queries mostly native, seems we can go back to look at this as it can give us more performance gain on shuffle (because native shuffle is faster than columnar shuffle).

viirya avatar May 22 '24 15:05 viirya

I will work on this after finishing HashJoin build right support.

viirya avatar May 22 '24 15:05 viirya

This would help with TPC-H performance, so I plan on taking a look at this issue soon

andygrove avatar Oct 03 '24 16:10 andygrove

Hi @andygrove May I ask why we decide not support RangePartitioning ? and will it be supported in the near future ? Thanks

jinwenjie123 avatar Mar 28 '25 20:03 jinwenjie123

Hi @andygrove May I ask why we decide not support RangePartitioning ? and will it be supported in the near future ? Thanks

We didn't decide not to support it.

It just hasn't been implemented yet because it is more complex than other partitioning schemes. It would be great if someone could submit a PR to add support for this. It could really help with benchmark results and performance overall.

andygrove avatar Mar 29 '25 01:03 andygrove

I discussed this feature with @mbutrovich recently and he may have additional thoughts on this topic.

andygrove avatar Mar 29 '25 01:03 andygrove

The implementation issue or difference for RangePartitioning other than other partitioning like HashPartitioning, is that it involves some sampling operations that perform with RDD in Spark. That's said Comet lib at each executor cannot determine the sampling results separately like we did for general query execution.

viirya avatar Mar 29 '25 01:03 viirya