datafusion-comet Reported OOM with high cardinality distrinct aggregates

Describe the bug

We have a user report that they are unable to get Comet to run certain aggregate queries that work fine in Spark.

This issue is to track the effort in creating a repro case so that we can understand the root cause.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

Jul 17 '24 15:07 andygrove

The OOM is happening in native code in the Comet shuffle write processor

Jul 18 '24 19:07 andygrove

Hmm, is any stack trace or other hint?

Jul 18 '24 19:07 viirya

Btw, the shuffle write processor pulls batches from current stage of execution. It doesn't have to be in shuffle code (during shuffling). I.e., if the write processor pulls batches from aggregation and OOM during aggregation.

Jul 18 '24 19:07 viirya

They also said they turned shuffle off, and got OOM in Java code. I don't think we have enough info.

Jul 18 '24 19:07 viirya

Closing this issue since it is vague and cannot reproduce. Native shuffle was re-implemented to be more memory efficient since this issue was filed.

Jun 16 '25 19:06 andygrove