Andy Grove
Andy Grove
> See [apache/spark@81639090622](https://github.com/apache/spark/commit/81639090622) for changes that were needed to the CPU BroadcastHashJoinExec that are probably relevant to the changes likely needed for the GPU version. This commit updated the `outputPartitioning`...
> are we waiting to address any feedback on this PR? I think I addressed all of the feedback from @martin-g
> Thanks @andygrove and @martin-g for the review. > > I feel this PR is good as it has the consistent issue before the PR and after PR it is...
> I'm a bit worried about this approach because we are implementing greedy mode inside `CometTaskMemoryManager`, which is known to starve consumers frequently. I prefer using fair spill pool for...
Closing in favor of https://github.com/apache/datafusion-comet/pull/1021
This is resolved for v1 data sources but not for v2.
~I have been learning more about Spark shuffle and now understand why this issue does not make sense.~ edit: I thought I understood this, but now I am not so...
Useful reference info: https://medium.com/@philipp.brunenberg/understanding-apache-spark-shuffle-85644d90c8c6
It seems that ShuffleWriterExec is invoked by ShuffleMapTask which handles reading the input RDD data, so we cannot override this mechanism easily.
Related upstream changes in arrow-rs: https://github.com/apache/arrow-rs/pull/6419