datafusion-ballista icon indicating copy to clipboard operation
datafusion-ballista copied to clipboard

Optimize shuffle before coalesce

Open andygrove opened this issue 3 years ago • 3 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

This looks inefficient. We are writing lots of shuffle files, reading them, and coalescing them into a single partition. Can we do the coalesce step before the shuffle write in this case?

opt-coalesce

Describe the solution you'd like Optimize

Describe alternatives you've considered None

Additional context None

andygrove avatar Oct 10 '22 14:10 andygrove

Could you please share me the SQL to reproduce the issue ?

mingmwang avatar Oct 12 '22 14:10 mingmwang

This is from benchmark q2, but I now think that I may be mistaken about this being an issue. The final step needs to coalesce for a sort and we want the parallelism in the previous stage.

andygrove avatar Oct 13 '22 02:10 andygrove

Ok, if it is not a bug, I think maybe you can close the issue.

mingmwang avatar Nov 15 '22 16:11 mingmwang