datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

TPC-DS q67 causes OOM after repeated runs

Open andygrove opened this issue 7 months ago • 1 comments

Describe the bug

I am running with the following config:

  • SF 1000 (1TB) dataset
  • 32 executors
  • Each executor has 16 cores and 32 GB memory + 32 GB off-heap memory
  • Data is partitioned by date, so query uses DPP, causing Comet to fall back to Spark early after the scans
  • Comet does run a columnar shuffle, but no native shuffle
  • Running in k8s

Executor memory grows over time, and executors start to get killed due to OOM after running the query ~37 times. This seems to indicate some kind of memory leak.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

andygrove avatar May 05 '25 22:05 andygrove

If I disable Comet shuffle, then the problem seems to go away, so it does look like an issue specific to Comet columnar shuffle.

andygrove avatar May 06 '25 01:05 andygrove