Support RangePartitioning with native shuffle
What is the problem the feature request solves?
We do not currently support RangePartitioning with native shuffle.
Adding this support would allow us to use native shuffle for more queries, including queries where there is a global sort.
Describe the potential solution
No response
Additional context
No response
Since the early development of Comet shuffle, this is one item on to do list. After we finish columnar shuffle which has higher coverage and support main partitioning types, we turn our focus to other more urgent items. Because we want to increase query coverage as much as possible. Currently we can run TPC-H queries mostly native, seems we can go back to look at this as it can give us more performance gain on shuffle (because native shuffle is faster than columnar shuffle).
I will work on this after finishing HashJoin build right support.
This would help with TPC-H performance, so I plan on taking a look at this issue soon
Hi @andygrove May I ask why we decide not support RangePartitioning ? and will it be supported in the near future ? Thanks
Hi @andygrove May I ask why we decide not support RangePartitioning ? and will it be supported in the near future ? Thanks
We didn't decide not to support it.
It just hasn't been implemented yet because it is more complex than other partitioning schemes. It would be great if someone could submit a PR to add support for this. It could really help with benchmark results and performance overall.
I discussed this feature with @mbutrovich recently and he may have additional thoughts on this topic.
The implementation issue or difference for RangePartitioning other than other partitioning like HashPartitioning, is that it involves some sampling operations that perform with RDD in Spark. That's said Comet lib at each executor cannot determine the sampling results separately like we did for general query execution.