datafusion-comet
datafusion-comet copied to clipboard
TPC-DS q67 causes OOM after repeated runs
Describe the bug
I am running with the following config:
- SF 1000 (1TB) dataset
- 32 executors
- Each executor has 16 cores and 32 GB memory + 32 GB off-heap memory
- Data is partitioned by date, so query uses DPP, causing Comet to fall back to Spark early after the scans
- Comet does run a columnar shuffle, but no native shuffle
- Running in k8s
Executor memory grows over time, and executors start to get killed due to OOM after running the query ~37 times. This seems to indicate some kind of memory leak.
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response
If I disable Comet shuffle, then the problem seems to go away, so it does look like an issue specific to Comet columnar shuffle.