velox
velox copied to clipboard
Error "iter != spillPartitionSet_.end()" from probe side spilling
Bug description
The following error is reported using Velox's commit https://github.com/facebookincubator/velox/commit/70c9f6c7126725247bfb8193067b18e11897cb97
2024-04-01T07:03:21.9003801Z Executing SQL query from resource path /tpcds-queries/q67.sql...
2024-04-01T07:03:24.3875546Z W20240401 07:03:24.386922 8768 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:24.6263103Z W20240401 07:03:24.625738 8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[RUNNING], nonReclaimableSection_[0], op.10.0.0.HashProbe, usage: 240.75KB, node pool usage: 20.00MB
2024-04-01T07:03:27.1143938Z W20240401 07:03:27.113842 8768 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 3.00MB
2024-04-01T07:03:27.2642109Z W20240401 07:03:27.263653 8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.38KB, node pool usage: 21.00MB
2024-04-01T07:03:31.0730661Z W20240401 07:03:31.072510 8773 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:31.4003666Z W20240401 07:03:31.399794 8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[RUNNING], nonReclaimableSection_[0], op.10.0.0.HashProbe, usage: 240.88KB, node pool usage: 20.00MB
2024-04-01T07:03:33.3298753Z W20240401 07:03:33.329300 8771 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:34.8949359Z W20240401 07:03:34.894352 8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 992.75KB, node pool usage: 21.00MB
2024-04-01T07:03:36.0343120Z W20240401 07:03:36.033732 8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 992.75KB, node pool usage: 21.00MB
2024-04-01T07:03:36.4535174Z W20240401 07:03:36.452941 8764 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:38.1457062Z W20240401 07:03:38.145025 8770 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:38.6754924Z W20240401 07:03:38.674748 8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[RUNNING], nonReclaimableSection_[0], op.10.0.0.HashProbe, usage: 241.00KB, node pool usage: 20.00MB
2024-04-01T07:03:42.1649884Z W20240401 07:03:42.164289 8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.50KB, node pool usage: 21.00MB
2024-04-01T07:03:42.7976060Z W20240401 07:03:42.796917 8621 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:43.5208389Z W20240401 07:03:43.520148 8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.50KB, node pool usage: 21.00MB
2024-04-01T07:03:43.8806747Z W20240401 07:03:43.879967 8766 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 476.00KB
2024-04-01T07:03:45.0089599Z W20240401 07:03:45.008244 8766 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 3.00MB
2024-04-01T07:03:45.2284630Z W20240401 07:03:45.227731 8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.50KB, node pool usage: 21.00MB
2024-04-01T07:03:45.3797499Z E20240401 07:03:45.378913 8766 Exceptions.h:69] Line: /__w/incubator-gluten/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/HashProbe.cpp:277, Function:maybeSetupSpillInputReader, Expression: iter != spillPartitionSet_.end() , Source: RUNTIME, ErrorCode: INVALID_STATE
2024-04-01T07:03:45.4701984Z 24/04/01 07:03:45 ERROR TaskResources: Task 48 failed by error:
2024-04-01T07:03:45.4703353Z org.apache.gluten.exception.GlutenException: java.lang.RuntimeException: Exception: VeloxRuntimeError
2024-04-01T07:03:45.4704939Z Error Source: RUNTIME
2024-04-01T07:03:45.4705549Z Error Code: INVALID_STATE
2024-04-01T07:03:45.4706104Z Retriable: False
2024-04-01T07:03:45.4706630Z Expression: iter != spillPartitionSet_.end()
2024-04-01T07:03:45.4707290Z Function: maybeSetupSpillInputReader
2024-04-01T07:03:45.4708655Z File: /__w/incubator-gluten/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/HashProbe.cpp
2024-04-01T07:03:45.4709643Z Line: 277
2024-04-01T07:03:45.4709980Z Stack trace:
2024-04-01T07:03:45.4710459Z # 0 _ZN8facebook5velox7process10StackTraceC1Ei
2024-04-01T07:03:45.4711627Z # 1 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
2024-04-01T07:03:45.4713439Z # 2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_
2024-04-01T07:03:45.4715426Z # 3 _ZN8facebook5velox4exec9HashProbe26maybeSetupSpillInputReaderERKSt8optionalINS1_16SpillPartitionIdEE
2024-04-01T07:03:45.4716807Z # 4 _ZN8facebook5velox4exec9HashProbe21asyncWaitForHashTableEv
2024-04-01T07:03:45.4717764Z # 5 _ZN8facebook5velox4exec9HashProbe17getOutputInternalEb
2024-04-01T07:03:45.4718595Z # 6 _ZN8facebook5velox4exec9HashProbe9getOutputEv
2024-04-01T07:03:45.4719774Z # 7 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
2024-04-01T07:03:45.4721212Z # 8 _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
2024-04-01T07:03:45.4722319Z # 9 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
2024-04-01T07:03:45.4723273Z # 10 _ZN6gluten24WholeStageResultIterator4nextEv
Perhaps it's related to probe-side spilling support? Since the backtrace includes the call to function maybeSetupSpillInputReader
which is added by the probe-side spilling PR.
@xiaoxmeng If you can get some information from the backtrace. Thanks.
System information
Velox System Info v0.0.2 Commit: eb2077d36b8dcf8380ef1e740cf6f51edb167744 CMake Version: 3.16.3 System: Linux-5.4.0-156-generic Arch: x86_64 C++ Compiler: /usr/bin/c++ C++ Compiler Version: 9.4.0 C Compiler: /usr/bin/cc C Compiler Version: 9.4.0 CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt
Full log at https://github.com/apache/incubator-gluten/actions/runs/8504258236/job/23292003386?pr=5224 (Gluten)
Log with https://github.com/facebookincubator/velox/pull/9333
https://gist.github.com/zhztheplayer/580190d9078e895676b4c8454b7eadf5
@xiaoxmeng We started observing this error in meta internal queries as well (e.g. 20240401_015934_01820_gbew7)
Will be fixed by https://github.com/facebookincubator/velox/pull/9334
Verified by @zhztheplayer