[VL] One task writes too many hive partitions causing OOM
Backend
VL (Velox)
Bug description
It seems that writing too many hive partitions causes Not enough spark off-heap execution memory
Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
24/10/23 10:05:16 ERROR Utils: Aborting task
org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::addInput failed for [operator: TableWrite, plan node ID: 2]: Error during calling Java code from native code: org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 8.0 MiB, granted: 7.0 MiB. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled is not enabled).
Current config settings:
spark.gluten.memory.offHeap.size.in.bytes=2.0 GiB
spark.gluten.memory.task.offHeap.size.in.bytes=2.0 GiB
spark.gluten.memory.conservative.task.offHeap.size.in.bytes=1024.0 MiB
spark.memory.offHeap.enabled=true
spark.gluten.memory.dynamic.offHeap.sizing.enabled=false
Memory consumer stats:
Task.6165: Current used bytes: 2041.0 MiB, peak bytes: N/A
\- Gluten.Tree.7: Current used bytes: 2041.0 MiB, peak bytes: 2.0 GiB
\- root.7: Current used bytes: 2041.0 MiB, peak bytes: 2.0 GiB
+- WholeStageIterator.7: Current used bytes: 2016.0 MiB, peak bytes: 2023.0 MiB
| \- single: Current used bytes: 2016.0 MiB, peak bytes: 2016.0 MiB
| +- root: Current used bytes: 1867.6 MiB, peak bytes: 2016.0 MiB
| | +- task.Gluten_Stage_135_TID_6165_VTID_7: Current used bytes: 1867.6 MiB, peak bytes: 2016.0 MiB
| | | +- node.2: Current used bytes: 1867.6 MiB, peak bytes: 2016.0 MiB
| | | | +- op.2.0.0.TableWrite.test-hive: Current used bytes: 1862.5 MiB, peak bytes: 2010.0 MiB
| | | | | +- op.2.0.0.TableWrite.test-hive.part[34]: Current used bytes: 27.1 MiB, peak bytes: 28.0 MiB
| | | | | | +- writer_node_16495590765074372487: Current used bytes: 27.1 MiB, peak bytes: 28.0 MiB
| | | | | | | \- .general: Current used bytes: 27.1 MiB, peak bytes: 27.2 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[34].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[55]: Current used bytes: 24.2 MiB, peak bytes: 28.0 MiB
| | | | | | +- writer_node_3034760001384861262: Current used bytes: 24.2 MiB, peak bytes: 28.0 MiB
| | | | | | | \- .general: Current used bytes: 24.2 MiB, peak bytes: 24.3 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[55].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[126]: Current used bytes: 22.9 MiB, peak bytes: 24.0 MiB
| | | | | | +- writer_node_10989164195300702660: Current used bytes: 22.9 MiB, peak bytes: 24.0 MiB
| | | | | | | \- .general: Current used bytes: 22.9 MiB, peak bytes: 22.9 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[126].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[190]: Current used bytes: 22.9 MiB, peak bytes: 24.0 MiB
| | | | | | +- writer_node_2462400773501287974: Current used bytes: 22.9 MiB, peak bytes: 24.0 MiB
| | | | | | | \- .general: Current used bytes: 22.9 MiB, peak bytes: 22.9 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[190].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[15]: Current used bytes: 22.6 MiB, peak bytes: 24.0 MiB
| | | | | | +- writer_node_10962424750642795316: Current used bytes: 22.6 MiB, peak bytes: 24.0 MiB
| | | | | | | \- .general: Current used bytes: 22.6 MiB, peak bytes: 22.7 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[15].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[110]: Current used bytes: 22.5 MiB, peak bytes: 24.0 MiB
| | | | | | +- writer_node_17871452156508439970: Current used bytes: 22.5 MiB, peak bytes: 24.0 MiB
| | | | | | | \- .general: Current used bytes: 22.5 MiB, peak bytes: 22.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[110].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[63]: Current used bytes: 21.9 MiB, peak bytes: 24.0 MiB
| | | | | | +- writer_node_11605731064327424053: Current used bytes: 21.9 MiB, peak bytes: 24.0 MiB
| | | | | | | \- .general: Current used bytes: 21.9 MiB, peak bytes: 22.0 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[63].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[174]: Current used bytes: 20.2 MiB, peak bytes: 24.0 MiB
| | | | | | +- writer_node_1465030665100865752: Current used bytes: 20.2 MiB, peak bytes: 24.0 MiB
| | | | | | | \- .general: Current used bytes: 20.2 MiB, peak bytes: 20.2 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[174].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[172]: Current used bytes: 18.9 MiB, peak bytes: 20.0 MiB
| | | | | | +- writer_node_7515311972563278385: Current used bytes: 18.9 MiB, peak bytes: 20.0 MiB
| | | | | | | \- .general: Current used bytes: 18.9 MiB, peak bytes: 18.9 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[172].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[113]: Current used bytes: 18.5 MiB, peak bytes: 20.0 MiB
| | | | | | +- writer_node_13120283611592924997: Current used bytes: 18.5 MiB, peak bytes: 20.0 MiB
| | | | | | | \- .general: Current used bytes: 18.5 MiB, peak bytes: 18.6 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[113].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[42]: Current used bytes: 18.5 MiB, peak bytes: 20.0 MiB
| | | | | | +- writer_node_9372597259039499552: Current used bytes: 18.5 MiB, peak bytes: 20.0 MiB
| | | | | | | \- .general: Current used bytes: 18.5 MiB, peak bytes: 18.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[42].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[131]: Current used bytes: 18.2 MiB, peak bytes: 20.0 MiB
| | | | | | +- writer_node_4488967983352450288: Current used bytes: 18.2 MiB, peak bytes: 20.0 MiB
| | | | | | | \- .general: Current used bytes: 18.2 MiB, peak bytes: 18.2 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[131].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[162]: Current used bytes: 15.4 MiB, peak bytes: 16.0 MiB
| | | | | | +- writer_node_17438843217648573119: Current used bytes: 15.4 MiB, peak bytes: 16.0 MiB
| | | | | | | \- .general: Current used bytes: 15.4 MiB, peak bytes: 15.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[162].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[65]: Current used bytes: 15.1 MiB, peak bytes: 16.0 MiB
| | | | | | +- writer_node_15452059679315520385: Current used bytes: 15.1 MiB, peak bytes: 16.0 MiB
| | | | | | | \- .general: Current used bytes: 15.1 MiB, peak bytes: 15.1 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[65].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[210]: Current used bytes: 15.0 MiB, peak bytes: 16.0 MiB
| | | | | | +- writer_node_9266752813434529133: Current used bytes: 15.0 MiB, peak bytes: 16.0 MiB
| | | | | | | \- .general: Current used bytes: 15.0 MiB, peak bytes: 15.1 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[210].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[67]: Current used bytes: 14.4 MiB, peak bytes: 15.0 MiB
| | | | | | +- writer_node_5157661972043366947: Current used bytes: 14.4 MiB, peak bytes: 15.0 MiB
| | | | | | | \- .general: Current used bytes: 14.4 MiB, peak bytes: 14.4 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[67].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[72]: Current used bytes: 14.0 MiB, peak bytes: 15.0 MiB
| | | | | | +- writer_node_14514910955711142343: Current used bytes: 14.0 MiB, peak bytes: 15.0 MiB
| | | | | | | \- .general: Current used bytes: 14.0 MiB, peak bytes: 14.0 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[72].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[195]: Current used bytes: 13.4 MiB, peak bytes: 14.0 MiB
| | | | | | +- writer_node_1485441638280894625: Current used bytes: 13.4 MiB, peak bytes: 14.0 MiB
| | | | | | | \- .general: Current used bytes: 13.4 MiB, peak bytes: 13.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[195].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[123]: Current used bytes: 13.4 MiB, peak bytes: 14.0 MiB
| | | | | | +- writer_node_6634720992138079112: Current used bytes: 13.4 MiB, peak bytes: 14.0 MiB
| | | | | | | \- .general: Current used bytes: 13.4 MiB, peak bytes: 13.4 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[123].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[185]: Current used bytes: 13.3 MiB, peak bytes: 14.0 MiB
| | | | | | +- writer_node_13535610051813534755: Current used bytes: 13.3 MiB, peak bytes: 14.0 MiB
| | | | | | | \- .general: Current used bytes: 13.3 MiB, peak bytes: 13.4 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[185].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[200]: Current used bytes: 13.3 MiB, peak bytes: 14.0 MiB
| | | | | | +- writer_node_3875021975498280443: Current used bytes: 13.3 MiB, peak bytes: 14.0 MiB
| | | | | | | \- .general: Current used bytes: 13.3 MiB, peak bytes: 13.3 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[200].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[59]: Current used bytes: 12.9 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_16272544847912354319: Current used bytes: 12.9 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.9 MiB, peak bytes: 13.0 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[59].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[166]: Current used bytes: 12.6 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_15485796053569832754: Current used bytes: 12.6 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.6 MiB, peak bytes: 12.7 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[166].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[157]: Current used bytes: 12.6 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_8295349848501177671: Current used bytes: 12.6 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.6 MiB, peak bytes: 12.7 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[157].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[64]: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_3964347345554411107: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.5 MiB, peak bytes: 12.6 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[64].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[88]: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_17501627176929250602: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.5 MiB, peak bytes: 12.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[88].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[165]: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_5103532593787709678: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.5 MiB, peak bytes: 12.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[165].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[111]: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_913615336661173562: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.5 MiB, peak bytes: 12.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[111].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[116]: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_16007420479186856992: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.5 MiB, peak bytes: 12.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[116].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[209]: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_13664721293374663391: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.5 MiB, peak bytes: 12.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[209].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[178]: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_7213323637378777817: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.5 MiB, peak bytes: 12.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[178].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[121]: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_5750783937089881792: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.5 MiB, peak bytes: 12.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[121].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[182]: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_12913595327245071541: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.5 MiB, peak bytes: 12.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[182].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[120]: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_8901286779273775421: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.5 MiB, peak bytes: 12.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[120].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | | +- op.2.0.0.TableWrite.test-hive.part[155]: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | +- writer_node_9259036891116703451: Current used bytes: 12.5 MiB, peak bytes: 13.0 MiB
| | | | | | | \- .general: Current used bytes: 12.5 MiB, peak bytes: 12.5 MiB
| | | | | | \- op.2.0.0.TableWrite.test-hive.part[155].sink: Current used bytes: 0.0 B, peak bytes: 0.0 B
......
| | | | \- op.2.0.0.TableWrite: Current used bytes: 5.1 MiB, peak bytes: 5.1 MiB
| | | +- node.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | | \- op.0.0.0.ValueStream: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | \- node.1: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | \- op.1.0.0.FilterProject: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- ShuffleReader.7: Current used bytes: 17.0 MiB, peak bytes: 24.0 MiB
| \- single: Current used bytes: 17.0 MiB, peak bytes: 24.0 MiB
| +- gluten::MemoryAllocator: Current used bytes: 10.0 MiB, peak bytes: 10.4 MiB
| \- root: Current used bytes: 384.0 KiB, peak bytes: 1024.0 KiB
| \- default_leaf: Current used bytes: 384.0 KiB, peak bytes: 384.0 KiB
+- ArrowContextInstance.14: Current used bytes: 8.0 MiB, peak bytes: 8.0 MiB
+- IndicatorVectorBase#init.7: Current used bytes: 0.0 B, peak bytes: 8.0 MiB
| \- single: Current used bytes: 0.0 B, peak bytes: 8.0 MiB
| +- root: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- OverAcquire.DummyTarget.37: Current used bytes: 0.0 B, peak bytes: 2.4 MiB
+- OverAcquire.DummyTarget.36: Current used bytes: 0.0 B, peak bytes: 7.2 MiB
+- OverAcquire.DummyTarget.35: Current used bytes: 0.0 B, peak bytes: 460.8 MiB
\- ArrowContextInstance.15: Current used bytes: 0.0 B, peak bytes: 0.0 B
at org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:105)
at org.apache.gluten.memory.listener.ManagedReservationListener.reserve(ManagedReservationListener.java:43)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:61)
at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at org.apache.gluten.utils.iterator.IteratorsV1$InvocationFlowProtection.hasNext(IteratorsV1.scala:159)
at org.apache.gluten.utils.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
at org.apache.gluten.utils.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
at org.apache.gluten.utils.iterator.IteratorsV1$LifeTimeAccumulator.hasNext(IteratorsV1.scala:100)
at org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD.$anonfun$compute$2(VeloxColumnarWriteFilesExec.scala:208)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1428)
at org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD.compute(VeloxColumnarWriteFilesExec.scala:203)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Retriable: False
Function: runInternal
File: /data/workspace/gluten-deploy-dist/ep/build-velox/build/velox_ep/velox/exec/Driver.cpp
Line: 677
Stack trace:
# 0 _ZN8facebook5velox7process10StackTraceC1Ei
# 1 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
# 3 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE.cold
# 4 _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
# 5 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 6 _ZN6gluten24WholeStageResultIterator4nextEv
# 7 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 8 0x00007f9fcb812ce8
at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:39)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at org.apache.gluten.utils.iterator.IteratorsV1$InvocationFlowProtection.hasNext(IteratorsV1.scala:159)
at org.apache.gluten.utils.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
at org.apache.gluten.utils.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
at org.apache.gluten.utils.iterator.IteratorsV1$LifeTimeAccumulator.hasNext(IteratorsV1.scala:100)
at org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD.$anonfun$compute$2(VeloxColumnarWriteFilesExec.scala:208)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1428)
at org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD.compute(VeloxColumnarWriteFilesExec.scala:203)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
So each write node takes >10M memory which caused the OOM. Should we have one write node and keep it open for each partition? @JkSelf
The query is using velox writer to write parquet data and we are hitting this error. Spark version: 3.3 Executor Configs: 1 executor per node memoryOverhead: [amount: 1024] cores: [amount: 16] memory: [amount: 13312] offHeap: [amount: 80896] This is how the plan looks like for the stage. Many scan, filter project fallbacks to spark as from_json is not supported yet. There are 23000 tasks spawned for this stage. It is a non partitioned write. Need suggestions on how configurations can be tweaked to make this job work?
Executor logs:
I20241216 17:07:15.881199 292073 WholeStageResultIterator.cc:234] Spill[root/root]: successfully reclaimed total 0B with shrunken 0B and spilled 0B.
I20241216 17:07:15.888181 292073 WholeStageResultIterator.cc:230] Spill[root/root]: trying to request spill for 8.00MB.
I20241216 17:07:15.888319 292073 WholeStageResultIterator.cc:234] Spill[root/root]: successfully reclaimed total 0B with shrunken 0B and spilled 0B.
I20241216 17:07:15.888373 292073 WholeStageResultIterator.cc:230] Spill[root/root]: trying to request spill for 8.00MB.
I20241216 17:07:15.888398 292073 WholeStageResultIterator.cc:234] Spill[root/root]: successfully reclaimed total 0B with shrunken 0B and spilled 0B.
I20241216 17:07:15.888561 292073 WholeStageResultIterator.cc:230] Spill[root/root]: trying to request spill for 509.40MB.
I20241216 17:07:15.888585 292073 WholeStageResultIterator.cc:234] Spill[root/root]: successfully reclaimed total 0B with shrunken 0B and spilled 0B.
I20241216 17:07:15.888605 292073 WholeStageResultIterator.cc:230] Spill[root/root]: trying to request spill for 509.40MB.
I20241216 17:07:15.888621 292073 WholeStageResultIterator.cc:234] Spill[root/root]: successfully reclaimed total 0B with shrunken 0B and spilled 0B.
24/12/16 17:07:15 ERROR ManagedReservationListener: Error reserving memory from target
org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 8.0 MiB, granted: 2.0 MiB. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled is not enabled).
Current config settings:
spark.gluten.memory.offHeap.size.in.bytes=75.0 GiB
spark.gluten.memory.task.offHeap.size.in.bytes=4.7 GiB
spark.gluten.memory.conservative.task.offHeap.size.in.bytes=2.3 GiB
spark.memory.offHeap.enabled=true
spark.gluten.memory.dynamic.offHeap.sizing.enabled=false
Memory consumer stats:
Task.28642: Current used bytes: 4.7 GiB, peak bytes: N/A
\- Gluten.Tree.1251: Current used bytes: 4.7 GiB, peak bytes: 4.7 GiB
\- root.1251: Current used bytes: 4.7 GiB, peak bytes: 4.7 GiB
+- ArrowContextInstance.272: Current used bytes: 2.9 GiB, peak bytes: 4.3 GiB
+- RowToColumnar.272: Current used bytes: 1696.0 MiB, peak bytes: 1698.0 MiB
| \- single: Current used bytes: 1696.0 MiB, peak bytes: 1696.0 MiB
| +- root: Current used bytes: 1695.8 MiB, peak bytes: 1696.0 MiB
| | \- default_leaf: Current used bytes: 1695.8 MiB, peak bytes: 1695.8 MiB
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- VeloxWriter.272: Current used bytes: 112.0 MiB, peak bytes: 184.0 MiB
| \- single: Current used bytes: 112.0 MiB, peak bytes: 184.0 MiB
| +- root: Current used bytes: 111.5 MiB, peak bytes: 184.0 MiB
| | +- datasource.272: Current used bytes: 111.5 MiB, peak bytes: 184.0 MiB
| | | \- .general: Current used bytes: 111.5 MiB, peak bytes: 176.7 MiB
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- ColumnarToRow.393: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| \- single: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| +- root: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| | \- default_leaf: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- NativePlanEvaluator-1332.0: Current used bytes: 4.0 MiB, peak bytes: 16.0 MiB
| \- single: Current used bytes: 4.0 MiB, peak bytes: 16.0 MiB
| +- root: Current used bytes: 900.9 KiB, peak bytes: 15.0 MiB
| | +- task.Gluten_Stage_26_TID_28642_VTID_1332: Current used bytes: 899.4 KiB, peak bytes: 14.0 MiB
| | | +- node.1: Current used bytes: 482.5 KiB, peak bytes: 2.0 MiB
| | | | \- op.1.0.0.FilterProject: Current used bytes: 482.5 KiB, peak bytes: 1443.3 KiB
| | | +- node.3: Current used bytes: 294.4 KiB, peak bytes: 11.0 MiB
| | | | \- op.3.0.0.FilterProject: Current used bytes: 294.4 KiB, peak bytes: 10.9 MiB
| | | +- node.2: Current used bytes: 122.5 KiB, peak bytes: 1024.0 KiB
| | | | \- op.2.0.0.FilterProject: Current used bytes: 122.5 KiB, peak bytes: 380.0 KiB
| | | \- node.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | \- op.0.0.0.ValueStream: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 1536.0 B, peak bytes: 1664.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- NativePlanEvaluator-1331.0: Current used bytes: 2.0 MiB, peak bytes: 8.0 MiB
| \- single: Current used bytes: 2.0 MiB, peak bytes: 8.0 MiB
| +- root: Current used bytes: 120.0 KiB, peak bytes: 2.0 MiB
| | +- task.Gluten_Stage_26_TID_28642_VTID_1331: Current used bytes: 120.0 KiB, peak bytes: 2.0 MiB
| | | +- node.1: Current used bytes: 96.0 KiB, peak bytes: 1024.0 KiB
| | | | \- op.1.0.0.Unnest: Current used bytes: 96.0 KiB, peak bytes: 96.0 KiB
| | | +- node.2: Current used bytes: 24.0 KiB, peak bytes: 1024.0 KiB
| | | | \- op.2.0.0.FilterProject: Current used bytes: 24.0 KiB, peak bytes: 24.0 KiB
| | | \- node.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | \- op.0.0.0.ValueStream: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- IteratorMetrics.1155.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- VeloxWriter.272.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 55.2 MiB
+- RowToColumnar.272.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 391.2 MiB
+- NativePlanEvaluator-1331.0.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 2.4 MiB
+- IndicatorVectorBase#init.1251.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- ColumnarToRow.393.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 19.2 MiB
+- IteratorMetrics.1155: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- single: Current used bytes: 0.0 B, peak bytes: 0.0 B
| +- root: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- NativePlanEvaluator-1332.0.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 4.8 MiB
\- IndicatorVectorBase#init.1251: Current used bytes: 0.0 B, peak bytes: 0.0 B
\- single: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- root: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
\- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
at org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:105)
at org.apache.gluten.memory.listener.ManagedReservationListener.reserve(ManagedReservationListener.java:49)
at org.apache.gluten.vectorized.NativeRowToColumnarJniWrapper.nativeConvertRowToColumnar(Native Method)
Task logs:
org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:654)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:448)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$22(FileFormatWriter.scala:346)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1505)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 3.9 GiB, granted: 2.7 GiB. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled is not enabled).
Current config settings:
spark.gluten.memory.offHeap.size.in.bytes=79.0 GiB
spark.gluten.memory.task.offHeap.size.in.bytes=4.9 GiB
spark.gluten.memory.conservative.task.offHeap.size.in.bytes=2.5 GiB
spark.memory.offHeap.enabled=true
spark.gluten.memory.dynamic.offHeap.sizing.enabled=false
Memory consumer stats:
Task.28426: Current used bytes: 2.3 GiB, peak bytes: N/A
\- Gluten.Tree.1304: Current used bytes: 2.3 GiB, peak bytes: 4.9 GiB
\- root.1304: Current used bytes: 2.3 GiB, peak bytes: 4.9 GiB
+- ArrowContextInstance.276: Current used bytes: 2000.0 MiB, peak bytes: 4.6 GiB
+- RowToColumnar.276: Current used bytes: 152.0 MiB, peak bytes: 1952.0 MiB
| \- single: Current used bytes: 152.0 MiB, peak bytes: 1952.0 MiB
| +- root: Current used bytes: 151.6 MiB, peak bytes: 1952.0 MiB
| | \- default_leaf: Current used bytes: 151.6 MiB, peak bytes: 1950.3 MiB
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- VeloxWriter.276: Current used bytes: 88.0 MiB, peak bytes: 176.0 MiB
| \- single: Current used bytes: 88.0 MiB, peak bytes: 176.0 MiB
| +- root: Current used bytes: 82.2 MiB, peak bytes: 176.0 MiB
| | +- datasource.276: Current used bytes: 82.2 MiB, peak bytes: 176.0 MiB
| | | \- .general: Current used bytes: 82.2 MiB, peak bytes: 175.8 MiB
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- ColumnarToRow.413: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| \- single: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| +- root: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| | \- default_leaf: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- NativePlanEvaluator-1381.0: Current used bytes: 4.0 MiB, peak bytes: 16.0 MiB
| \- single: Current used bytes: 4.0 MiB, peak bytes: 16.0 MiB
| +- root: Current used bytes: 722.1 KiB, peak bytes: 16.0 MiB
| | +- task.Gluten_Stage_26_TID_28426_VTID_1381: Current used bytes: 720.6 KiB, peak bytes: 15.0 MiB
| | | +- node.1: Current used bytes: 369.0 KiB, peak bytes: 2.0 MiB
| | | | \- op.1.0.0.FilterProject: Current used bytes: 369.0 KiB, peak bytes: 1523.0 KiB
| | | +- node.3: Current used bytes: 229.1 KiB, peak bytes: 12.0 MiB
| | | | \- op.3.0.0.FilterProject: Current used bytes: 229.1 KiB, peak bytes: 11.2 MiB
| | | +- node.2: Current used bytes: 122.5 KiB, peak bytes: 1024.0 KiB
| | | | \- op.2.0.0.FilterProject: Current used bytes: 122.5 KiB, peak bytes: 380.0 KiB
| | | \- node.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | \- op.0.0.0.ValueStream: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 1536.0 B, peak bytes: 1664.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- NativePlanEvaluator-1380.0: Current used bytes: 2.0 MiB, peak bytes: 8.0 MiB
| \- single: Current used bytes: 2.0 MiB, peak bytes: 8.0 MiB
| +- root: Current used bytes: 120.0 KiB, peak bytes: 2.0 MiB
| | +- task.Gluten_Stage_26_TID_28426_VTID_1380: Current used bytes: 120.0 KiB, peak bytes: 2.0 MiB
| | | +- node.1: Current used bytes: 96.0 KiB, peak bytes: 1024.0 KiB
| | | | \- op.1.0.0.Unnest: Current used bytes: 96.0 KiB, peak bytes: 96.0 KiB
| | | +- node.2: Current used bytes: 24.0 KiB, peak bytes: 1024.0 KiB
| | | | \- op.2.0.0.FilterProject: Current used bytes: 24.0 KiB, peak bytes: 24.0 KiB
| | | \- node.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | \- op.0.0.0.ValueStream: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- IndicatorVectorBase#init.1304: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- single: Current used bytes: 0.0 B, peak bytes: 0.0 B
| +- root: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- VeloxWriter.276.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 52.8 MiB
+- NativePlanEvaluator-1381.0.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 4.8 MiB
+- IteratorMetrics.1204.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- ColumnarToRow.413.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 19.2 MiB
+- RowToColumnar.276.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 585.6 MiB
+- IteratorMetrics.1204: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- single: Current used bytes: 0.0 B, peak bytes: 0.0 B
| +- root: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- IndicatorVectorBase#init.1304.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
\- NativePlanEvaluator-1380.0.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 2.4 MiB
at org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:105)
at org.apache.gluten.memory.arrow.alloc.ManagedAllocationListener.onPreAllocation(ManagedAllocationListener.java:61)
at org.apache.gluten.shaded.org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:300)
at org.apache.gluten.shaded.org.apache.arrow.memory.RootAllocator.buffer(RootAllocator.java:29)
at org.apache.gluten.shaded.org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:280)
at org.apache.gluten.shaded.org.apache.arrow.memory.RootAllocator.buffer(RootAllocator.java:29)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:200)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:138)
at org.apache.gluten.iterator.IteratorsV1$InvocationFlowProtection.next(IteratorsV1.scala:178)
at org.apache.gluten.iterator.IteratorsV1$IteratorCompleter.next(IteratorsV1.scala:79)
at org.apache.gluten.iterator.IteratorsV1$PayloadCloser.next(IteratorsV1.scala:41)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:33)
at org.apache.gluten.vectorized.ColumnarBatchInIterator.next(ColumnarBatchInIterator.java:39)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNext0(ColumnarBatchOutIterator.java:57)
at org.apache.gluten.iterator.ClosableIterator.hasNext(ClosableIterator.java:39)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at org.apache.gluten.iterator.IteratorsV1$InvocationFlowProtection.hasNext(IteratorsV1.scala:159)
at org.apache.gluten.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
at org.apache.gluten.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
at org.apache.gluten.iterator.IteratorsV1$LifeTimeAccumulator.hasNext(IteratorsV1.scala:100)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:95)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:429)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1539)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:438)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$22(FileFormatWriter.scala:346)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1505)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Retriable: False
Function: operator()
File: /home/abc/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Driver.cpp
Line: 601
Stack trace:
0 _ZN8facebook5velox7process10StackTraceC1Ei
1 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
3 _ZZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEEENKUlvE3_clEv.cold
4 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
5 _ZN8facebook5velox4exec6Driver4nextEPN5folly10SemiFutureINS3_4UnitEEE
6 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
7 _ZN6gluten24WholeStageResultIterator4nextEv
8 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
9 0x00007f36695f9a7b
at org.apache.gluten.iterator.ClosableIterator.hasNext(ClosableIterator.java:41)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at org.apache.gluten.iterator.IteratorsV1$InvocationFlowProtection.hasNext(IteratorsV1.scala:159)
at org.apache.gluten.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
at org.apache.gluten.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
at org.apache.gluten.iterator.IteratorsV1$LifeTimeAccumulator.hasNext(IteratorsV1.scala:100)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:95)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:429)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1539)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:438)
... 9 more
Caused by: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 3.9 GiB, granted: 2.7 GiB. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled is not enabled).
Current config settings:
spark.gluten.memory.offHeap.size.in.bytes=79.0 GiB
spark.gluten.memory.task.offHeap.size.in.bytes=4.9 GiB
spark.gluten.memory.conservative.task.offHeap.size.in.bytes=2.5 GiB
spark.memory.offHeap.enabled=true
spark.gluten.memory.dynamic.offHeap.sizing.enabled=false
Memory consumer stats:
Task.28426: Current used bytes: 2.3 GiB, peak bytes: N/A
\- Gluten.Tree.1304: Current used bytes: 2.3 GiB, peak bytes: 4.9 GiB
\- root.1304: Current used bytes: 2.3 GiB, peak bytes: 4.9 GiB
+- ArrowContextInstance.276: Current used bytes: 2000.0 MiB, peak bytes: 4.6 GiB
+- RowToColumnar.276: Current used bytes: 152.0 MiB, peak bytes: 1952.0 MiB
| \- single: Current used bytes: 152.0 MiB, peak bytes: 1952.0 MiB
| +- root: Current used bytes: 151.6 MiB, peak bytes: 1952.0 MiB
| | \- default_leaf: Current used bytes: 151.6 MiB, peak bytes: 1950.3 MiB
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- VeloxWriter.276: Current used bytes: 88.0 MiB, peak bytes: 176.0 MiB
| \- single: Current used bytes: 88.0 MiB, peak bytes: 176.0 MiB
| +- root: Current used bytes: 82.2 MiB, peak bytes: 176.0 MiB
| | +- datasource.276: Current used bytes: 82.2 MiB, peak bytes: 176.0 MiB
| | | \- .general: Current used bytes: 82.2 MiB, peak bytes: 175.8 MiB
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- ColumnarToRow.413: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| \- single: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| +- root: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| | \- default_leaf: Current used bytes: 64.0 MiB, peak bytes: 64.0 MiB
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- NativePlanEvaluator-1381.0: Current used bytes: 4.0 MiB, peak bytes: 16.0 MiB
| \- single: Current used bytes: 4.0 MiB, peak bytes: 16.0 MiB
| +- root: Current used bytes: 722.1 KiB, peak bytes: 16.0 MiB
| | +- task.Gluten_Stage_26_TID_28426_VTID_1381: Current used bytes: 720.6 KiB, peak bytes: 15.0 MiB
| | | +- node.1: Current used bytes: 369.0 KiB, peak bytes: 2.0 MiB
| | | | \- op.1.0.0.FilterProject: Current used bytes: 369.0 KiB, peak bytes: 1523.0 KiB
| | | +- node.3: Current used bytes: 229.1 KiB, peak bytes: 12.0 MiB
| | | | \- op.3.0.0.FilterProject: Current used bytes: 229.1 KiB, peak bytes: 11.2 MiB
| | | +- node.2: Current used bytes: 122.5 KiB, peak bytes: 1024.0 KiB
| | | | \- op.2.0.0.FilterProject: Current used bytes: 122.5 KiB, peak bytes: 380.0 KiB
| | | \- node.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | \- op.0.0.0.ValueStream: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 1536.0 B, peak bytes: 1664.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- NativePlanEvaluator-1380.0: Current used bytes: 2.0 MiB, peak bytes: 8.0 MiB
| \- single: Current used bytes: 2.0 MiB, peak bytes: 8.0 MiB
| +- root: Current used bytes: 120.0 KiB, peak bytes: 2.0 MiB
| | +- task.Gluten_Stage_26_TID_28426_VTID_1380: Current used bytes: 120.0 KiB, peak bytes: 2.0 MiB
| | | +- node.1: Current used bytes: 96.0 KiB, peak bytes: 1024.0 KiB
| | | | \- op.1.0.0.Unnest: Current used bytes: 96.0 KiB, peak bytes: 96.0 KiB
| | | +- node.2: Current used bytes: 24.0 KiB, peak bytes: 1024.0 KiB
| | | | \- op.2.0.0.FilterProject: Current used bytes: 24.0 KiB, peak bytes: 24.0 KiB
| | | \- node.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | | \- op.0.0.0.ValueStream: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- IndicatorVectorBase#init.1304: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- single: Current used bytes: 0.0 B, peak bytes: 0.0 B
| +- root: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- VeloxWriter.276.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 52.8 MiB
+- NativePlanEvaluator-1381.0.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 4.8 MiB
+- IteratorMetrics.1204.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- ColumnarToRow.413.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 19.2 MiB
+- RowToColumnar.276.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 585.6 MiB
+- IteratorMetrics.1204: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- single: Current used bytes: 0.0 B, peak bytes: 0.0 B
| +- root: Current used bytes: 0.0 B, peak bytes: 0.0 B
| | \- default_leaf: Current used bytes: 0.0 B, peak bytes: 0.0 B
| \- gluten::MemoryAllocator: Current used bytes: 0.0 B, peak bytes: 0.0 B
+- IndicatorVectorBase#init.1304.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 0.0 B
\- NativePlanEvaluator-1380.0.OverAcquire.0: Current used bytes: 0.0 B, peak bytes: 2.4 MiB
at org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:105)
at org.apache.gluten.memory.arrow.alloc.ManagedAllocationListener.onPreAllocation(ManagedAllocationListener.java:61)
at org.apache.gluten.shaded.org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:300)
at org.apache.gluten.shaded.org.apache.arrow.memory.RootAllocator.buffer(RootAllocator.java:29)
at org.apache.gluten.shaded.org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:280)
at org.apache.gluten.shaded.org.apache.arrow.memory.RootAllocator.buffer(RootAllocator.java:29)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:200)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:138)
at org.apache.gluten.iterator.IteratorsV1$InvocationFlowProtection.next(IteratorsV1.scala:178)
at org.apache.gluten.iterator.IteratorsV1$IteratorCompleter.next(IteratorsV1.scala:79)
at org.apache.gluten.iterator.IteratorsV1$PayloadCloser.next(IteratorsV1.scala:41)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:33)
at org.apache.gluten.vectorized.ColumnarBatchInIterator.next(ColumnarBatchInIterator.java:39)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNext0(ColumnarBatchOutIterator.java:57)
at org.apache.gluten.iterator.ClosableIterator.hasNext(ClosableIterator.java:39)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at org.apache.gluten.iterator.IteratorsV1$InvocationFlowProtection.hasNext(IteratorsV1.scala:159)
at org.apache.gluten.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
at org.apache.gluten.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
at org.apache.gluten.iterator.IteratorsV1$LifeTimeAccumulator.hasNext(IteratorsV1.scala:100)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:95)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:429)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1539)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:438)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$22(FileFormatWriter.scala:346)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1505)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Retriable: False
Function: operator()
File: /home/abc/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Driver.cpp
Line: 601
Stack trace:
0 _ZN8facebook5velox7process10StackTraceC1Ei
1 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
3 _ZZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEEENKUlvE3_clEv.cold
4 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
5 _ZN8facebook5velox4exec6Driver4nextEPN5folly10SemiFutureINS3_4UnitEEE
6 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
7 _ZN6gluten24WholeStageResultIterator4nextEv
8 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
9 0x00007f36695f9a7b
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNext0(ColumnarBatchOutIterator.java:57)
at org.apache.gluten.iterator.ClosableIterator.hasNext(ClosableIterator.java:39)
... 19 more
It's not the same issue. Here your memory is allocated by ArrowContextInstance and R2C. What's your reducer#?
It's not the same issue. Here your memory is allocated by ArrowContextInstance and R2C. What's your reducer#?
Sorry, I didn't understand your question, are you asking for number of reducers? This query had a single stage with 23663 tasks, where each task does a union of data from 7 different branches and writes the results to a storage location.
Sorry, I didn't understand your question, are you asking for number of reducers? This query had a single stage with 23663 tasks, where each task does a union of data from 7 different branches and writes the results to a storage location.
Your error msg shows the memory is occupied by ArrowContextInstance, it's used by shuffle and velox to arrow converter in parquet writer. If it's in shuffle you may try sort based shuffle. If it's in parquet write, it may because the arrow batch size is too large, may because too many rows in the batch, or too large data size in each row. You may check the batch size in UI.
@wForget Is the issue still there in your side? looks not fixed.
@wForget Is the issue still there in your side? looks not fixed.
Yes, this issue still exists, but we can avoid it by kyuubi spark sql extension (InsertRebalanceBeforeWrite and FinalStageConfigIsolation optimizers).
InsertRebalanceBeforeWrite optimized plan like this:
QueryPlans
|
RebalanceByColumn(part columns)
|
WriteFileExec
Then, we disable coalescePartitions for the final write stage:
set spark.sql.finalStage.adaptive.coalescePartitions.enabled=false;
After that, different hive partitions will be distributed in different tasks, and we can avoid OOM caused by one velox task writing too many hive partitions.