incubator-gluten icon indicating copy to clipboard operation
incubator-gluten copied to clipboard

[VL] Failing allocation in DynamicOffHeapSizingMemoryTarget

Open wenwj0 opened this issue 1 year ago • 10 comments

Backend

VL (Velox)

Bug description

I am trying to set spark.gluten.memory.dynamic.offHeap.sizing.enabled=true", but OOM exception occurs. spark configuration :

spark.executor.memory=4g;
spark.executor.memoryOverhead=1G;
spark.gluten.memory.dynamic.offHeap.sizing.enabled=true
spark.memory.offHeap.enabled=true

and web ui shows:

spark.gluten.memory.conservative.task.offHeap.size.in.bytes=597059174
spark.gluten.memory.offHeap.size.in.bytes=2388236697
spark.gluten.memory.task.offHeap.size.in.bytes=597059174
spark.memory.offHeap.size=2388236697

And I got the OOM exception:

Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 8.0 MiB, granted: 0.0 B. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled is not enabled). 
Current config settings: 
    spark.gluten.memory.offHeap.size.in.bytes=3.4 GiB
    spark.gluten.memory.task.offHeap.size.in.bytes=876.6 MiB
    spark.gluten.memory.conservative.task.offHeap.size.in.bytes=876.6 MiB
    spark.memory.offHeap.enabled=true
    spark.gluten.memory.dynamic.offHeap.sizing.enabled=true
Memory consumer stats: 
    Task.52:                                             Current used bytes: 104.0 MiB, peak bytes:        N/A
    \- Gluten.Tree.0:                                    Current used bytes: 104.0 MiB, peak bytes:  112.0 MiB
       \- root.0:                                        Current used bytes: 104.0 MiB, peak bytes:  112.0 MiB
          +- CelebornShuffleWriter.0:                    Current used bytes:  48.0 MiB, peak bytes:   48.0 MiB
          |  \- single:                                  Current used bytes:  48.0 MiB, peak bytes:   48.0 MiB
          |     +- gluten::MemoryAllocator:              Current used bytes:  28.8 MiB, peak bytes:   29.0 MiB
          |     \- root:                                 Current used bytes:   4.2 MiB, peak bytes:   15.0 MiB
          |        \- default_leaf:                      Current used bytes:   4.2 MiB, peak bytes:   14.1 MiB

It may cause by:

24/10/18 17:16:00 WARN org.apache.gluten.memory.memtarget.DynamicOffHeapSizingMemoryTarget: "Failing allocation as unified memory is OOM. Used Off-heap: 406847480, Used On-Heap: 2021017784, Free On-heap: 1796847432, Total On-heap: 3817865216, Max On-heap: 2388236697, Allocation: 8388608."
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "Memory used in task 11"
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "Acquired by org.apache.gluten.memory.memtarget.spark.TreeMemoryConsumer@182a8cbe: 104.0 MiB"
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "0 bytes of memory were used by task 11 but are not associated with specific consumers"
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "406847480 bytes of memory are used for execution and 1129714 bytes of memory are used for storage"

As we can see above, the used off-heap memory is only 406847480(388MiB), while my off-heap configuration is 2.2GiB.
Why will throw OOM exception?

Spark version

Spark-3.2.x

Spark configurations

No response

System information

No response

Relevant logs

No response

wenwj0 avatar Oct 18 '24 12:10 wenwj0

https://github.com/apache/incubator-gluten/blob/main/gluten-core/src/main/java/org/apache/gluten/memory/memtarget/DynamicOffHeapSizingMemoryTarget.java#L51

I have one question here:

if (size + usedOffHeapBytesNow + usedOnHeapBytes > MAX_MEMORY_IN_BYTES) {
    ...
}

Why we need to use the usedOnHeapBytes to compare with MAX_MEMORY_IN_BYTES(offHeapMemorySize)?

wenwj0 avatar Oct 18 '24 12:10 wenwj0

PTAL @supermem613 @zhztheplayer

wenwj0 avatar Oct 18 '24 12:10 wenwj0

When spark.gluten.memory.dynamic.offHeap.sizing.enabled=true it will not consider the configured off-heap size https://github.com/apache/incubator-gluten/blob/main/gluten-core/src/main/scala/org/apache/gluten/GlutenPlugin.scala#L167

acvictor avatar Oct 21 '24 05:10 acvictor

When spark.gluten.memory.dynamic.offHeap.sizing.enabled=true it will not consider the configured off-heap size https://github.com/apache/incubator-gluten/blob/main/gluten-core/src/main/scala/org/apache/gluten/GlutenPlugin.scala#L167

Perhaps there was a tiny typo leading to a bug when PR #5439 was iterated? cc @supermem613 Would you like to help confirm? Thanks.

A commit that looks like to be related in PR #5439 https://github.com/apache/incubator-gluten/pull/5439/commits/8c7cfa59bf9f8c015e16c225d3d14c7801a977dd

zhztheplayer avatar Oct 21 '24 05:10 zhztheplayer

When spark.gluten.memory.dynamic.offHeap.sizing.enabled=true it will not consider the configured off-heap size https://github.com/apache/incubator-gluten/blob/main/gluten-core/src/main/scala/org/apache/gluten/GlutenPlugin.scala#L167

Perhaps there was a tiny typo leading to a bug when PR #5439 was iterated? cc @supermem613 Would you like to help confirm? Thanks.

A commit that looks like to be related in PR #5439 8c7cfa5

@zhztheplayer what typo do you mean?

acvictor avatar Oct 21 '24 06:10 acvictor

When spark.gluten.memory.dynamic.offHeap.sizing.enabled=true it will not consider the configured off-heap size https://github.com/apache/incubator-gluten/blob/main/gluten-core/src/main/scala/org/apache/gluten/GlutenPlugin.scala#L167

Perhaps there was a tiny typo leading to a bug when PR #5439 was iterated? cc @supermem613 Would you like to help confirm? Thanks. A commit that looks like to be related in PR #5439 8c7cfa5

@zhztheplayer what typo do you mean?

Found this line

https://github.com/apache/incubator-gluten/commit/8c7cfa59bf9f8c015e16c225d3d14c7801a977dd#diff-b6234f870afb82ba142a4f4e3e358ddb30dc4d5f00b3b9f5b4e9afddc9b4a761R31

GlutenConfig.getConf().onHeapMemorySize() was changed to GlutenConfig.getConf().offHeapMemorySize(), which doesn't look like intentional to me, but I am not sure.

zhztheplayer avatar Oct 21 '24 06:10 zhztheplayer

Why are you using the option spark.gluten.memory.dynamic.offHeap.sizing.enabled? @wenwj0

zhztheplayer avatar Oct 21 '24 06:10 zhztheplayer

GlutenConfig.getConf().onHeapMemorySize() was changed to GlutenConfig.getConf().offHeapMemorySize(), which doesn't look like intentional to me, but I am not sure.

After reading more code I guess this was intentional... The code tended to set an off-heap size then use the size to cover off-heap + on-heap. The PR used on-heap initially at the time when I was reviewing.

@wenwj0

As we can see above, the used off-heap memory is only 406847480(388MiB), while my off-heap configuration is 2.2GiB.

I believe your case is because on-heap occupies the remaining 2.2 GiB - 388 MiB, which could be considered by-design of the feature. But it would be great if you can share more background information about your use case. From my impression only a few users are using the option.

zhztheplayer avatar Oct 21 '24 06:10 zhztheplayer

Why are you using the option spark.gluten.memory.dynamic.offHeap.sizing.enabled? @wenwj0

In our scenario, we have various configurations for different executor memory. I try to use this property because it seems that I can use the existing configurations without the need to set extra off-heap memory.

I believe your case is because on-heap occupies the remaining 2.2 GiB - 388 MiB, which could be considered by-design of the feature. But it would be great if you can share more background information about your use case. From my impression only a few users are using the option.

I agree with you. MAX_MEMORY_IN_BYTES refers to off-heap memory, hence (size + usedOffHeapBytesNow + usedOnHeapBytes > MAX_MEMORY_IN_BYTES) => (8MiB + 388MiB + 1927MiB = 2323MiB > MAX_MEMORY_IN_BYTES(2GiB)

If the MAX_MEMORY_IN_BYTES is onheap memory or onheap+offheap memory, maybe it works. @zhztheplayer

wenwj0 avatar Oct 21 '24 09:10 wenwj0

onheap+offheap memory

It sounds OK to me to change current approach to on-heap + off-heap, which may simplify the relevant configuration setting code by the way.

cc @supermem613 @zhli1142015

zhztheplayer avatar Oct 22 '24 05:10 zhztheplayer

Sorry, I was offline for the last 10 days. @zhztheplayer is correct, this feature's goal is so that one can configure the configure the executor memory (on-heap + off-heap) and allow the use to fluctuate between the two as needed (e.g. in the case of fallback).

supermem613 avatar Oct 28 '24 13:10 supermem613