Optimize shuffle before coalesce
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This looks inefficient. We are writing lots of shuffle files, reading them, and coalescing them into a single partition. Can we do the coalesce step before the shuffle write in this case?

Describe the solution you'd like Optimize
Describe alternatives you've considered None
Additional context None
Could you please share me the SQL to reproduce the issue ?
This is from benchmark q2, but I now think that I may be mistaken about this being an issue. The final step needs to coalesce for a sort and we want the parallelism in the previous stage.
Ok, if it is not a bug, I think maybe you can close the issue.