incubator-gluten icon indicating copy to clipboard operation
incubator-gluten copied to clipboard

[VL] system hang during spill of hashagg

Open FelixYBW opened this issue 8 months ago • 0 comments

Backend

VL (Velox)

Bug description

Error message.

W20240621 15:14:01.929342 114227 Operator.cpp:641] Can't reclaim from memory pool op.5.0.0.Aggregation which is under non-reclaimable section, memory usage: 231.99MB, reservation: 232.00MB
W20240621 15:14:01.930936 101314 Operator.cpp:641] Can't reclaim from memory pool op.5.0.0.Aggregation which is under non-reclaimable section, memory usage: 128.00MB, reservation: 128.00MB
W20240621 15:14:01.931005 101314 Operator.cpp:641] Can't reclaim from memory pool op.5.0.0.Aggregation which is under non-reclaimable section, memory usage: 128.00MB, reservation: 128.00MB
W20240621 15:14:01.934880 114227 HashAggregation.cpp:408] Can't reclaim from aggregation operator which has spilled and is under output processing, pool op.5.0.0.Aggregation, memory usage: 236.76MB, reservation: 240.00MB
24/06/21 15:14:01 ERROR [Executor task launch worker for task 2259.0 in stage 2.0 (TID 14859)] nmm.ManagedReservationListener: Error reserving memory from target
java.lang.NullPointerException
	at java.util.Objects.requireNonNull(Objects.java:203)
	at java.util.Optional.<init>(Optional.java:96)
	at java.util.Optional.of(Optional.java:108)
	at org.apache.gluten.memory.nmm.NativeMemoryManagers$1.spill(NativeMemoryManagers.java:79)
	at org.apache.gluten.memory.memtarget.Spillers$WithMinSpillSize.spill(Spillers.java:57)
	at org.apache.gluten.memory.memtarget.TreeMemoryTargets.spillTree(TreeMemoryTargets.java:90)
	at org.apache.gluten.memory.memtarget.TreeMemoryTargets.spillTree(TreeMemoryTargets.java:61)
	at org.apache.gluten.memory.memtarget.TreeMemoryTargets.spillTree(TreeMemoryTargets.java:80)
	at org.apache.gluten.memory.memtarget.TreeMemoryTargets.spillTree(TreeMemoryTargets.java:61)
	at org.apache.gluten.memory.memtarget.TreeMemoryTargets.spillTree(TreeMemoryTargets.java:80)
	at org.apache.gluten.memory.memtarget.TreeMemoryTargets.spillTree(TreeMemoryTargets.java:61)
	at org.apache.gluten.memory.memtarget.spark.TreeMemoryConsumer.spill(TreeMemoryConsumer.java:120)
	at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:213)
	at org.apache.spark.memory.MemoryConsumer.acquireMemory(MemoryConsumer.java:136)
	at org.apache.gluten.memory.memtarget.spark.TreeMemoryConsumer.borrow(TreeMemoryConsumer.java:70)
	at org.apache.gluten.memory.memtarget.TreeMemoryTargets$Node.borrow0(TreeMemoryTargets.java:137)
	at org.apache.gluten.memory.memtarget.TreeMemoryTargets$Node.borrow(TreeMemoryTargets.java:129)
	at org.apache.gluten.memory.memtarget.TreeMemoryTargets$Node.borrow0(TreeMemoryTargets.java:137)
	at org.apache.gluten.memory.memtarget.TreeMemoryTargets$Node.borrow(TreeMemoryTargets.java:129)
	at org.apache.gluten.memory.memtarget.OverAcquire.borrow(OverAcquire.java:56)
	at org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:35)
	at org.apache.gluten.memory.nmm.ManagedReservationListener.reserve(ManagedReservationListener.java:43)
	at org.apache.gluten.memory.nmm.NativeMemoryManager.create(Native Method)
	at org.apache.gluten.memory.nmm.NativeMemoryManager.create(NativeMemoryManager.java:49)
	at org.apache.gluten.memory.nmm.NativeMemoryManagers.createNativeMemoryManager(NativeMemoryManagers.java:155)
	at org.apache.gluten.memory.nmm.NativeMemoryManagers.create(NativeMemoryManagers.java:56)
	at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:159)
	at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:242)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1471)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

FelixYBW avatar Jun 21 '24 18:06 FelixYBW