ExecutionMemoryPool errors releasing more memory than allocated
Describe the bug
As seen in https://github.com/apache/datafusion-comet/issues/2452, Comet sometimes tries to release memory that is not allocated.
25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called on 917504 bytes but task only has 0 bytes of memory from the off-heap execution pool
25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called on 839664 bytes but task only has 0 bytes of memory from the off-heap execution pool
25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called on 917504 bytes but task only has 0 bytes of memory from the off-heap execution pool
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response
Here is some more info. I am running with this config:
--conf spark.executor.cores=8 \
--conf spark.executor.memory=16G \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=24g \
The memory issues start to happen after the following code fails to allocate the requested memory:
// Called by Comet native through JNI.
// Returns the actual amount of memory (in bytes) granted.
public long acquireMemory(long size) {
long acquired = internal.acquireExecutionMemory(size, nativeMemoryConsumer);
used.addAndGet(acquired);
if (acquired < size) {
// If memory manager is not able to acquire the requested size, log memory usage
internal.showMemoryUsage();
}
return acquired;
}
The call to showMemoryUsage logs messages such as:
INFO TaskMemoryManager: Memory used in task 21129
INFO TaskMemoryManager: 3221225472 bytes of memory were used by task 21129 but are not associated with specific consumers
INFO TaskMemoryManager: 18329799984 bytes of memory are used for execution and 1009734 bytes of memory are used for storage
This indicates that one task is using 3GB, and a total of 18GB is allocated in the off-heap pool.
This is followed by
WARN ExecutionMemoryPool: Internal error: release called on 821408 bytes but task only has 0 bytes of memory from the off-heap execution pool
ERROR Executor: Exception in task 113.0 in stage 199.3 (TID 21150)
comet-benchmark-derived-from-tpch-08872799821543de-exec-7 spark-kubernetes-executor org.apache.comet.CometNativeException: Additional allocation failed with top memory consumers (across reservations) as:
comet-benchmark-derived-from-tpch-08872799821543de-exec-7 spark-kubernetes-executor ExternalSorter[138]#692(can spill: true) consumed 2.9 GB, peak 3.1 GB,
comet-benchmark-derived-from-tpch-08872799821543de-exec-7 spark-kubernetes-executor ExternalSorterMerge[138]#693(can spill: false) consumed 70.2 MB, peak 70.2 MB,
comet-benchmark-derived-from-tpch-08872799821543de-exec-7 spark-kubernetes-executor SMJStream[138]#694(can spill: false) consumed 0.0 B, peak 0.0 B.
comet-benchmark-derived-from-tpch-08872799821543de-exec-7 spark-kubernetes-executor Error: Failed to acquire 409640 bytes, only got 83912. Reserved: 3222910152
more debug logging, which now more clearly shows what is happening:
25/09/26 15:50:30 INFO CometTaskMemoryManager: Task 1561 requested 734496 bytes
25/09/26 15:50:30 INFO core/src/execution/memory_pools/unified_pool.rs: Task 1561 shrink() release_to_spark(734496)
25/09/26 15:50:30 INFO CometTaskMemoryManager: Task 1561 released 734496 bytes
25/09/26 15:50:30 WARN ExecutionMemoryPool: Internal error: release called on 734496 bytes but task only has 0 bytes of memory from the off-heap execution pool
25/09/26 15:50:30 INFO CometTaskMemoryManager: Task 1561 released 734496 bytes
25/09/26 15:50:30 WARN ExecutionMemoryPool: Internal error: release called on 734496 bytes but task only has 0 bytes of memory from the off-heap execution pool
Isn't CometTaskMemoryManager calling the memory pool to do the release? Is this simply a question of updating the memory accounting and then actually trying to release (which checks if there is enough memory to release)?
Either way, tremendous bit of debugging!
DataFusion operators are attempting to release more memory than acquired. We could choose to just ignore attempts to make pools negative, but I'd rather track down the root cause because it seems like a bug.