[VL] Failing allocation in DynamicOffHeapSizingMemoryTarget
Backend
VL (Velox)
Bug description
I am trying to set spark.gluten.memory.dynamic.offHeap.sizing.enabled=true", but OOM exception occurs.
spark configuration :
spark.executor.memory=4g;
spark.executor.memoryOverhead=1G;
spark.gluten.memory.dynamic.offHeap.sizing.enabled=true
spark.memory.offHeap.enabled=true
and web ui shows:
spark.gluten.memory.conservative.task.offHeap.size.in.bytes=597059174
spark.gluten.memory.offHeap.size.in.bytes=2388236697
spark.gluten.memory.task.offHeap.size.in.bytes=597059174
spark.memory.offHeap.size=2388236697
And I got the OOM exception:
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 8.0 MiB, granted: 0.0 B. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled is not enabled).
Current config settings:
spark.gluten.memory.offHeap.size.in.bytes=3.4 GiB
spark.gluten.memory.task.offHeap.size.in.bytes=876.6 MiB
spark.gluten.memory.conservative.task.offHeap.size.in.bytes=876.6 MiB
spark.memory.offHeap.enabled=true
spark.gluten.memory.dynamic.offHeap.sizing.enabled=true
Memory consumer stats:
Task.52: Current used bytes: 104.0 MiB, peak bytes: N/A
\- Gluten.Tree.0: Current used bytes: 104.0 MiB, peak bytes: 112.0 MiB
\- root.0: Current used bytes: 104.0 MiB, peak bytes: 112.0 MiB
+- CelebornShuffleWriter.0: Current used bytes: 48.0 MiB, peak bytes: 48.0 MiB
| \- single: Current used bytes: 48.0 MiB, peak bytes: 48.0 MiB
| +- gluten::MemoryAllocator: Current used bytes: 28.8 MiB, peak bytes: 29.0 MiB
| \- root: Current used bytes: 4.2 MiB, peak bytes: 15.0 MiB
| \- default_leaf: Current used bytes: 4.2 MiB, peak bytes: 14.1 MiB
It may cause by:
24/10/18 17:16:00 WARN org.apache.gluten.memory.memtarget.DynamicOffHeapSizingMemoryTarget: "Failing allocation as unified memory is OOM. Used Off-heap: 406847480, Used On-Heap: 2021017784, Free On-heap: 1796847432, Total On-heap: 3817865216, Max On-heap: 2388236697, Allocation: 8388608."
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "Memory used in task 11"
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "Acquired by org.apache.gluten.memory.memtarget.spark.TreeMemoryConsumer@182a8cbe: 104.0 MiB"
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "0 bytes of memory were used by task 11 but are not associated with specific consumers"
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "406847480 bytes of memory are used for execution and 1129714 bytes of memory are used for storage"
As we can see above, the used off-heap memory is only 406847480(388MiB), while my off-heap configuration is 2.2GiB.
Why will throw OOM exception?
Spark version
Spark-3.2.x
Spark configurations
No response
System information
No response
Relevant logs
No response
https://github.com/apache/incubator-gluten/blob/main/gluten-core/src/main/java/org/apache/gluten/memory/memtarget/DynamicOffHeapSizingMemoryTarget.java#L51
I have one question here:
if (size + usedOffHeapBytesNow + usedOnHeapBytes > MAX_MEMORY_IN_BYTES) {
...
}
Why we need to use the usedOnHeapBytes to compare with MAX_MEMORY_IN_BYTES(offHeapMemorySize)?
PTAL @supermem613 @zhztheplayer
When spark.gluten.memory.dynamic.offHeap.sizing.enabled=true it will not consider the configured off-heap size https://github.com/apache/incubator-gluten/blob/main/gluten-core/src/main/scala/org/apache/gluten/GlutenPlugin.scala#L167
When
spark.gluten.memory.dynamic.offHeap.sizing.enabled=trueit will not consider the configured off-heap size https://github.com/apache/incubator-gluten/blob/main/gluten-core/src/main/scala/org/apache/gluten/GlutenPlugin.scala#L167
Perhaps there was a tiny typo leading to a bug when PR #5439 was iterated? cc @supermem613 Would you like to help confirm? Thanks.
A commit that looks like to be related in PR #5439 https://github.com/apache/incubator-gluten/pull/5439/commits/8c7cfa59bf9f8c015e16c225d3d14c7801a977dd
When
spark.gluten.memory.dynamic.offHeap.sizing.enabled=trueit will not consider the configured off-heap size https://github.com/apache/incubator-gluten/blob/main/gluten-core/src/main/scala/org/apache/gluten/GlutenPlugin.scala#L167Perhaps there was a tiny typo leading to a bug when PR #5439 was iterated? cc @supermem613 Would you like to help confirm? Thanks.
A commit that looks like to be related in PR #5439 8c7cfa5
@zhztheplayer what typo do you mean?
When
spark.gluten.memory.dynamic.offHeap.sizing.enabled=trueit will not consider the configured off-heap size https://github.com/apache/incubator-gluten/blob/main/gluten-core/src/main/scala/org/apache/gluten/GlutenPlugin.scala#L167Perhaps there was a tiny typo leading to a bug when PR #5439 was iterated? cc @supermem613 Would you like to help confirm? Thanks. A commit that looks like to be related in PR #5439 8c7cfa5
@zhztheplayer what typo do you mean?
Found this line
https://github.com/apache/incubator-gluten/commit/8c7cfa59bf9f8c015e16c225d3d14c7801a977dd#diff-b6234f870afb82ba142a4f4e3e358ddb30dc4d5f00b3b9f5b4e9afddc9b4a761R31
GlutenConfig.getConf().onHeapMemorySize() was changed to GlutenConfig.getConf().offHeapMemorySize(), which doesn't look like intentional to me, but I am not sure.
Why are you using the option spark.gluten.memory.dynamic.offHeap.sizing.enabled? @wenwj0
GlutenConfig.getConf().onHeapMemorySize()was changed toGlutenConfig.getConf().offHeapMemorySize(), which doesn't look like intentional to me, but I am not sure.
After reading more code I guess this was intentional... The code tended to set an off-heap size then use the size to cover off-heap + on-heap. The PR used on-heap initially at the time when I was reviewing.
@wenwj0
As we can see above, the used off-heap memory is only 406847480(388MiB), while my off-heap configuration is 2.2GiB.
I believe your case is because on-heap occupies the remaining 2.2 GiB - 388 MiB, which could be considered by-design of the feature. But it would be great if you can share more background information about your use case. From my impression only a few users are using the option.
Why are you using the option spark.gluten.memory.dynamic.offHeap.sizing.enabled? @wenwj0
In our scenario, we have various configurations for different executor memory. I try to use this property because it seems that I can use the existing configurations without the need to set extra off-heap memory.
I believe your case is because on-heap occupies the remaining 2.2 GiB - 388 MiB, which could be considered by-design of the feature. But it would be great if you can share more background information about your use case. From my impression only a few users are using the option.
I agree with you. MAX_MEMORY_IN_BYTES refers to off-heap memory, hence (size + usedOffHeapBytesNow + usedOnHeapBytes > MAX_MEMORY_IN_BYTES) => (8MiB + 388MiB + 1927MiB = 2323MiB > MAX_MEMORY_IN_BYTES(2GiB)
If the MAX_MEMORY_IN_BYTES is onheap memory or onheap+offheap memory, maybe it works. @zhztheplayer
onheap+offheap memory
It sounds OK to me to change current approach to on-heap + off-heap, which may simplify the relevant configuration setting code by the way.
cc @supermem613 @zhli1142015
Sorry, I was offline for the last 10 days. @zhztheplayer is correct, this feature's goal is so that one can configure the configure the executor memory (on-heap + off-heap) and allow the use to fluctuate between the two as needed (e.g. in the case of fallback).