velox icon indicating copy to clipboard operation
velox copied to clipboard

Error "iter != spillPartitionSet_.end()" from probe side spilling

Open zhztheplayer opened this issue 10 months ago • 3 comments

Bug description

The following error is reported using Velox's commit https://github.com/facebookincubator/velox/commit/70c9f6c7126725247bfb8193067b18e11897cb97

2024-04-01T07:03:21.9003801Z Executing SQL query from resource path /tpcds-queries/q67.sql...
2024-04-01T07:03:24.3875546Z W20240401 07:03:24.386922  8768 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:24.6263103Z W20240401 07:03:24.625738  8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[RUNNING], nonReclaimableSection_[0], op.10.0.0.HashProbe, usage: 240.75KB, node pool usage: 20.00MB
2024-04-01T07:03:27.1143938Z W20240401 07:03:27.113842  8768 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 3.00MB
2024-04-01T07:03:27.2642109Z W20240401 07:03:27.263653  8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.38KB, node pool usage: 21.00MB
2024-04-01T07:03:31.0730661Z W20240401 07:03:31.072510  8773 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:31.4003666Z W20240401 07:03:31.399794  8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[RUNNING], nonReclaimableSection_[0], op.10.0.0.HashProbe, usage: 240.88KB, node pool usage: 20.00MB
2024-04-01T07:03:33.3298753Z W20240401 07:03:33.329300  8771 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:34.8949359Z W20240401 07:03:34.894352  8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 992.75KB, node pool usage: 21.00MB
2024-04-01T07:03:36.0343120Z W20240401 07:03:36.033732  8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 992.75KB, node pool usage: 21.00MB
2024-04-01T07:03:36.4535174Z W20240401 07:03:36.452941  8764 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:38.1457062Z W20240401 07:03:38.145025  8770 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:38.6754924Z W20240401 07:03:38.674748  8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[RUNNING], nonReclaimableSection_[0], op.10.0.0.HashProbe, usage: 241.00KB, node pool usage: 20.00MB
2024-04-01T07:03:42.1649884Z W20240401 07:03:42.164289  8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.50KB, node pool usage: 21.00MB
2024-04-01T07:03:42.7976060Z W20240401 07:03:42.796917  8621 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:43.5208389Z W20240401 07:03:43.520148  8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.50KB, node pool usage: 21.00MB
2024-04-01T07:03:43.8806747Z W20240401 07:03:43.879967  8766 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 476.00KB
2024-04-01T07:03:45.0089599Z W20240401 07:03:45.008244  8766 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 3.00MB
2024-04-01T07:03:45.2284630Z W20240401 07:03:45.227731  8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.50KB, node pool usage: 21.00MB
2024-04-01T07:03:45.3797499Z E20240401 07:03:45.378913  8766 Exceptions.h:69] Line: /__w/incubator-gluten/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/HashProbe.cpp:277, Function:maybeSetupSpillInputReader, Expression: iter != spillPartitionSet_.end() , Source: RUNTIME, ErrorCode: INVALID_STATE
2024-04-01T07:03:45.4701984Z 24/04/01 07:03:45 ERROR TaskResources: Task 48 failed by error: 
2024-04-01T07:03:45.4703353Z org.apache.gluten.exception.GlutenException: java.lang.RuntimeException: Exception: VeloxRuntimeError
2024-04-01T07:03:45.4704939Z Error Source: RUNTIME
2024-04-01T07:03:45.4705549Z Error Code: INVALID_STATE
2024-04-01T07:03:45.4706104Z Retriable: False
2024-04-01T07:03:45.4706630Z Expression: iter != spillPartitionSet_.end()
2024-04-01T07:03:45.4707290Z Function: maybeSetupSpillInputReader
2024-04-01T07:03:45.4708655Z File: /__w/incubator-gluten/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/HashProbe.cpp
2024-04-01T07:03:45.4709643Z Line: 277
2024-04-01T07:03:45.4709980Z Stack trace:
2024-04-01T07:03:45.4710459Z # 0  _ZN8facebook5velox7process10StackTraceC1Ei
2024-04-01T07:03:45.4711627Z # 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
2024-04-01T07:03:45.4713439Z # 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_
2024-04-01T07:03:45.4715426Z # 3  _ZN8facebook5velox4exec9HashProbe26maybeSetupSpillInputReaderERKSt8optionalINS1_16SpillPartitionIdEE
2024-04-01T07:03:45.4716807Z # 4  _ZN8facebook5velox4exec9HashProbe21asyncWaitForHashTableEv
2024-04-01T07:03:45.4717764Z # 5  _ZN8facebook5velox4exec9HashProbe17getOutputInternalEb
2024-04-01T07:03:45.4718595Z # 6  _ZN8facebook5velox4exec9HashProbe9getOutputEv
2024-04-01T07:03:45.4719774Z # 7  _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
2024-04-01T07:03:45.4721212Z # 8  _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
2024-04-01T07:03:45.4722319Z # 9  _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
2024-04-01T07:03:45.4723273Z # 10 _ZN6gluten24WholeStageResultIterator4nextEv

Perhaps it's related to probe-side spilling support? Since the backtrace includes the call to function maybeSetupSpillInputReader which is added by the probe-side spilling PR.

@xiaoxmeng If you can get some information from the backtrace. Thanks.

System information

Velox System Info v0.0.2 Commit: eb2077d36b8dcf8380ef1e740cf6f51edb167744 CMake Version: 3.16.3 System: Linux-5.4.0-156-generic Arch: x86_64 C++ Compiler: /usr/bin/c++ C++ Compiler Version: 9.4.0 C Compiler: /usr/bin/cc C Compiler Version: 9.4.0 CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

Full log at https://github.com/apache/incubator-gluten/actions/runs/8504258236/job/23292003386?pr=5224 (Gluten)

zhztheplayer avatar Apr 01 '24 07:04 zhztheplayer

Log with https://github.com/facebookincubator/velox/pull/9333

https://gist.github.com/zhztheplayer/580190d9078e895676b4c8454b7eadf5

zhztheplayer avatar Apr 02 '24 05:04 zhztheplayer

@xiaoxmeng We started observing this error in meta internal queries as well (e.g. 20240401_015934_01820_gbew7)

amitkdutta avatar Apr 02 '24 08:04 amitkdutta

Will be fixed by https://github.com/facebookincubator/velox/pull/9334

zhztheplayer avatar Apr 03 '24 08:04 zhztheplayer

Verified by @zhztheplayer

xiaoxmeng avatar Apr 05 '24 06:04 xiaoxmeng