lmdeploy [Bug] Ascend v0.7.2.post1，对serving api测速，概率性卡死

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

ascend卡部署QwQ-32B，使用官方制作的ascend镜像

先用一些短文本warmup，然后使用1k个更长的文本对api server进行测速，1k prompt跑多次，1k prompt运行过程中，概率性出现卡死，server报错

Reproduction

serving

lmdeploy serve api_server \
    --backend pytorch \
    --device ascend \
    --server-port 23333 \
    --tp 8 \
    --dtype bfloat16 \
    --chat-template qwen2d5 \
    --model-name QwQ-32B \
    /cache/hf_models/QwQ-32B/

首先，warmup

MODEL_NAME="QwQ-32B"
TOKENIZER_PATH=/cache/hf_models/QwQ-32B/
SEED=42

for N in 1 64 256 512
do
    echo "warmup with N=${N}"
    python profile_restful_api.py \
        --host localhost \
        --port 23333 \
        --backend lmdeploy \
        --dataset-name random \
        --dataset-path /cache/ShareGPT_V3_unfiltered_cleaned_split.json \
        --random-input-len 256 \
        --random-output-len 128 \
        --random-range-ratio 0.5 \
        --model ${MODEL_NAME} \
        --tokenizer ${TOKENIZER_PATH} \
        --seed ${SEED} \
        --num-prompts ${N}
done

然后，使用1k个prompt测试，运行多次

MODEL_NAME="QwQ-32B"
TOKENIZER_PATH=/cache/hf_models/QwQ-32B/
SEED=42
NUM_PROMPTS=1000

for i in `seq 1 10`
do
    python profile_restful_api.py \
        --host localhost \
        --port 23333 \
        --backend lmdeploy \
        --dataset-name random \
        --dataset-path /cache/ShareGPT_V3_unfiltered_cleaned_split.json \
        --random-input-len 1024 \
        --random-output-len 1024 \
        --random-range-ratio 0.5 \
        --model ${MODEL_NAME} \
        --tokenizer ${TOKENIZER_PATH} \
        --seed ${SEED} \
        --num-prompts ${NUM_PROMPTS}
done

Environment

sys.platform: linux
Python: 3.10.5 (main, Mar 24 2025, 07:28:13) [GCC 9.4.0]
CUDA available: False
MUSA available: False
numpy_random_seed: 2147483648
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.3.1
PyTorch compiling details: PyTorch built with:
  - GCC 10.2
  - C++ Version: 201703
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-10/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=open, TORCH_VERSION=2.3.1, USE_CUDA=OFF, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.18.1
LMDeploy: 0.7.2.post1+
transformers: 4.50.0
gradio: Not Found
fastapi: 0.115.12
pydantic: 2.10.6
triton: Not Found

Error traceback

server日志报错，需要等挺久，到timeout时间

EI0002: [PID: 198418] 2025-03-26-13:18:05.447.186 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[125802984], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:05 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[5].]
        Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
        Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
        TraceBack (most recent call last):
        The error from device(chipId:5, dieId:0), serial number is 17, hccl fftsplus task timeout occurred during task execution, stream_id:7, sq_id:7, task_id:9865, stuck notify num:3, timeout:1836.[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1657]
        The 0 stuck notify wait context info:(context_id=13, notify_id=17).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The 1 stuck notify wait context info:(context_id=15, notify_id=19).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The 2 stuck notify wait context info:(context_id=17, notify_id=27).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[125802984], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:05 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[5].]
        rtStreamSynchronize execute failed, reason=[fftsplus timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 507048[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

EI0002: [PID: 199085] 2025-03-26-13:18:05.474.973 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[882633192], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:05 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[7].]
        Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
        Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
        TraceBack (most recent call last):
        The error from device(chipId:7, dieId:0), serial number is 17, hccl fftsplus task timeout occurred during task execution, stream_id:7, sq_id:7, task_id:9865, stuck notify num:1, timeout:1836.[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1657]
        The 0 stuck notify wait context info:(context_id=2, notify_id=8).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[882633192], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:05 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[7].]
        rtStreamSynchronize execute failed, reason=[fftsplus timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 507048[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
EI0002: [PID: 196972] 2025-03-26-13:18:05.501.624 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[763329000], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:06 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[0].]
        Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s
, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
        Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
        TraceBack (most recent call last):
        The error from device(chipId:0, dieId:0), serial number is 13, hccl fftsplus task timeout occurred during task execution, stream_id:11, sq_id:11, task_id:9865, stuck notify num:1, timeout:1836.[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1657]
        The 0 stuck notify wait context info:(context_id=12, notify_id=11).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[763329000], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:06 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[0].]
        rtStreamSynchronize execute failed, reason=[fftsplus timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 507048[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

EI0002: [PID: 197625] 2025-03-26-13:18:05.532.566 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[587172328], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:07 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[3].]
        Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
        Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
        TraceBack (most recent call last):
        The error from device(chipId:3, dieId:0), serial number is 30, hccl fftsplus task timeout occurred during task execution, stream_id:7, sq_id:7, task_id:9865, stuck notify num:3, timeout:1836.[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1657]
        The 0 stuck notify wait context info:(context_id=9, notify_id=11).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The 1 stuck notify wait context info:(context_id=11, notify_id=23).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The 2 stuck notify wait context info:(context_id=13, notify_id=16).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[587172328], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:07 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[3].]
        rtStreamSynchronize execute failed, reason=[fftsplus timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 507048[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

EI0002: [PID: 197144] 2025-03-26-13:18:06.069.316 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[4278159848], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:06 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[1].]
        Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
        Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
        TraceBack (most recent call last):
        The error from device(chipId:1, dieId:0), serial number is 29, hccl fftsplus task timeout occurred during task execution, stream_id:7, sq_id:7, task_id:9865, stuck notify num:3, timeout:1836.[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1657]
        The 0 stuck notify wait context info:(context_id=5, notify_id=1).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The 1 stuck notify wait context info:(context_id=7, notify_id=7).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The 2 stuck notify wait context info:(context_id=9, notify_id=13).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[4278159848], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:06 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[1].]
        rtStreamSynchronize execute failed, reason=[fftsplus timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 507048[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

EI0002: [PID: 198083] 2025-03-26-13:18:06.072.050 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[511670760], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:06 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[4].]
        Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
        Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
        TraceBack (most recent call last):
        The error from device(chipId:4, dieId:0), serial number is 14, hccl fftsplus task timeout occurred during task execution, stream_id:7, sq_id:7, task_id:9865, stuck notify num:3, timeout:1836.[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1657]
        The 0 stuck notify wait context info:(context_id=11, notify_id=1).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The 1 stuck notify wait context info:(context_id=13, notify_id=7).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The 2 stuck notify wait context info:(context_id=15, notify_id=11).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[511670760], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:06 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[4].]
        rtStreamSynchronize execute failed, reason=[fftsplus timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 507048[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

EI0002: [PID: 197247] 2025-03-26-13:18:06.126.352 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[614197736], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:07 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[2].]
        Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
        Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
        TraceBack (most recent call last):
        The error from device(chipId:2, dieId:0), serial number is 30, hccl fftsplus task timeout occurred during task execution, stream_id:7, sq_id:7, task_id:9865, stuck notify num:3, timeout:1836.[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1657]
        The 0 stuck notify wait context info:(context_id=7, notify_id=9).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The 1 stuck notify wait context info:(context_id=9, notify_id=18).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The 2 stuck notify wait context info:(context_id=11, notify_id=23).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[614197736], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.16.38.219/6], Arrival Time:[Wed Mar 26 13:07:07 2025], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[2].]
        rtStreamSynchronize execute failed, reason=[fftsplus timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 507048[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

2025-03-26 13:18:06,621 - lmdeploy - ERROR - model_agent.py:470 - Task <ModelAgentLoop> failed
Traceback (most recent call last):
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 465, in _on_finish_callback
    task.result()
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 456, in _async_loop_background
    await self._async_step_background(**forward_inputs, )
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 396, in _async_step_background
    output = await self._async_model_forward(
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 289, in _async_model_forward
    ret = await __forward(inputs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 249, in __forward
    return await self.async_forward(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 632, in async_forward
    output = self._forward_impl(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 616, in _forward_impl
    output = model_forward(
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 57, in model_forward
    context = ctx_mgr.build_context(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 438, in build_context
    return StepContext.new(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 410, in new
    ret = get_backend().update_step_context(ret)
  File "/opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py", line 117, in update_step_context
    q_seqlens_list = step_context.q_seqlens.tolist()
RuntimeError: ACL stream synchronize failed, error code:507048
2025-03-26 13:18:06,621 - lmdeploy - ERROR - model_agent.py:470 - Task <ModelAgentLoop> failed
Traceback (most recent call last):
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 465, in _on_finish_callback
    task.result()
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 456, in _async_loop_background
    await self._async_step_background(**forward_inputs, )
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 396, in _async_step_background
    output = await self._async_model_forward(
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 289, in _async_model_forward
    ret = await __forward(inputs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 249, in __forward
    return await self.async_forward(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 632, in async_forward
    output = self._forward_impl(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 616, in _forward_impl
    output = model_forward(
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 57, in model_forward
    context = ctx_mgr.build_context(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 438, in build_context
    return StepContext.new(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 410, in new
    ret = get_backend().update_step_context(ret)
  File "/opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py", line 117, in update_step_context
    q_seqlens_list = step_context.q_seqlens.tolist()
RuntimeError: ACL stream synchronize failed, error code:507048
2025-03-26 13:18:06,621 - lmdeploy - ERROR - model_agent.py:470 - Task <ModelAgentLoop> failed
Traceback (most recent call last):
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 465, in _on_finish_callback
    task.result()
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 456, in _async_loop_background
    await self._async_step_background(**forward_inputs, )
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 396, in _async_step_background
    output = await self._async_model_forward(
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 289, in _async_model_forward
    ret = await __forward(inputs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 249, in __forward
    return await self.async_forward(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 632, in async_forward
    output = self._forward_impl(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 616, in _forward_impl
    output = model_forward(
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 57, in model_forward
    context = ctx_mgr.build_context(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 438, in build_context
    return StepContext.new(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 410, in new
    ret = get_backend().update_step_context(ret)
  File "/opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py", line 117, in update_step_context
    q_seqlens_list = step_context.q_seqlens.tolist()
RuntimeError: ACL stream synchronize failed, error code:507048
2025-03-26 13:18:06,621 - lmdeploy - ERROR - model_agent.py:470 - Task <ModelAgentLoop> failed
Traceback (most recent call last):
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 465, in _on_finish_callback
    task.result()
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 456, in _async_loop_background
    await self._async_step_background(**forward_inputs, )
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 396, in _async_step_background
    output = await self._async_model_forward(
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 289, in _async_model_forward
    ret = await __forward(inputs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 249, in __forward
    return await self.async_forward(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 632, in async_forward
    output = self._forward_impl(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 616, in _forward_impl
    output = model_forward(
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 57, in model_forward
    context = ctx_mgr.build_context(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 438, in build_context
    return StepContext.new(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 410, in new
    ret = get_backend().update_step_context(ret)
  File "/opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py", line 117, in update_step_context
    q_seqlens_list = step_context.q_seqlens.tolist()
RuntimeError: ACL stream synchronize failed, error code:507048
2025-03-26 13:18:06,621 - lmdeploy - ERROR - model_agent.py:470 - Task <ModelAgentLoop> failed
Traceback (most recent call last):
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 465, in _on_finish_callback
    task.result()
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 456, in _async_loop_background
    await self._async_step_background(**forward_inputs, )
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 396, in _async_step_background
    output = await self._async_model_forward(
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 289, in _async_model_forward
    ret = await __forward(inputs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 249, in __forward
    return await self.async_forward(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 632, in async_forward
    output = self._forward_impl(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 616, in _forward_impl
    output = model_forward(
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 57, in model_forward
    context = ctx_mgr.build_context(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 438, in build_context
    return StepContext.new(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 410, in new
    ret = get_backend().update_step_context(ret)
  File "/opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py", line 117, in update_step_context
    q_seqlens_list = step_context.q_seqlens.tolist()
RuntimeError: ACL stream synchronize failed, error code:507048
2025-03-26 13:18:06,621 - lmdeploy - ERROR - model_agent.py:470 - Task <ModelAgentLoop> failed
Traceback (most recent call last):
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 465, in _on_finish_callback
    task.result()
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 456, in _async_loop_background
    await self._async_step_background(**forward_inputs, )
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 396, in _async_step_background
    output = await self._async_model_forward(
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 289, in _async_model_forward
    ret = await __forward(inputs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 249, in __forward
    return await self.async_forward(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 632, in async_forward
    output = self._forward_impl(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 616, in _forward_impl
    output = model_forward(
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 57, in model_forward
    context = ctx_mgr.build_context(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 438, in build_context
    return StepContext.new(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 410, in new
    ret = get_backend().update_step_context(ret)
  File "/opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py", line 117, in update_step_context
    q_seqlens_list = step_context.q_seqlens.tolist()
RuntimeError: ACL stream synchronize failed, error code:507048
2025-03-26 13:18:06,621 - lmdeploy - ERROR - model_agent.py:470 - Task <ModelAgentLoop> failed
Traceback (most recent call last):
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 465, in _on_finish_callback
    task.result()
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 456, in _async_loop_background
    await self._async_step_background(**forward_inputs, )
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 396, in _async_step_background
    output = await self._async_model_forward(
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 289, in _async_model_forward
    ret = await __forward(inputs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 249, in __forward
    return await self.async_forward(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 632, in async_forward
    output = self._forward_impl(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 616, in _forward_impl
    output = model_forward(
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 57, in model_forward
    context = ctx_mgr.build_context(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 438, in build_context
    return StepContext.new(
  File "/opt/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 410, in new
    ret = get_backend().update_step_context(ret)
  File "/opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/ascend/op_backend.py", line 117, in update_step_context
    q_seqlens_list = step_context.q_seqlens.tolist()
RuntimeError: ACL stream synchronize failed, error code:507048
2025-03-26 13:18:06,623 - lmdeploy - ERROR - mp_executor.py:276 - Received custom termination signal from sub processing, exiting...
Process ExecutorProc-0:
Traceback (most recent call last):
  File "/usr/local/python3.10.5/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/python3.10.5/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/executor/mp_executor.py", line 538, in _main_loop
    worker.release()
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/executor/base_worker.py", line 161, in release
    self.model_agent.release()
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 654, in release
    torch.cuda.empty_cache()
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/npu/memory.py", line 144, in empty_cache
    torch_npu._C._npu_emptyCache()
RuntimeError: npuSynchronizeDevice:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:380 NPU function error: aclrtSynchronizeDevice, error code is 507048
[ERROR] 2025-03-26-13:18:07 (PID:196972, Device:0, RankID:0) ERR00100 PTA call acl api failed
[Error]: The execution of the internal task times out. 
        Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[fftsplus timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 196972] 2025-03-26-13:18:07.157.948 wait for compute device to finish failed, runtime result = 507048.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):

Process ExecutorProc-6:
Traceback (most recent call last):
  File "/usr/local/python3.10.5/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/python3.10.5/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/executor/mp_executor.py", line 538, in _main_loop
    worker.release()
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/executor/base_worker.py", line 161, in release
    self.model_agent.release()
  File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 654, in release
    torch.cuda.empty_cache()
  File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch_npu/npu/memory.py", line 144, in empty_cache
    torch_npu._C._npu_emptyCache()
RuntimeError: npuSynchronizeDevice:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:380 NPU function error: aclrtSynchronizeDevice, error code is 507048
[ERROR] 2025-03-26-13:49:01 (PID:198748, Device:6, RankID:6) ERR00100 PTA call acl api failed
[Error]: The execution of the internal task times out.
        Rectify the fault based on the error information in the ascend log.
EI0002: [PID: 198748] 2025-03-26-13:49:01.670.614 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[293579240], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: []. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[6].]
        Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
        Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
        TraceBack (most recent call last):
        The error from device(chipId:6, dieId:0), serial number is 16, hccl fftsplus task timeout occurred during task execution, stream_id:7, sq_id:7, task_id:9865, stuck notify num:1, timeout:1836.[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1657]
        The 0 stuck notify wait context info:(context_id=2, notify_id=9).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1664]
        The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[293579240], taskID[9865], tag[Broadcast_172.16.38.219%eth0_60000_0_1742963775856197], AlgType(level 0-1-2):[fullmesh-H-D-ring].]. task information: []. group information: [group:[172.16.38.219%eth0_60000_0_1742963775856197], user define information[], rankSize[8], rankId[6].]
        rtDeviceSynchronize execute failed, reason=[fftsplus timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        wait for compute device to finish failed, runtime result = 507048.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

Mar 27 '25 03:03 llan-ml

收到，重现中。。

Mar 27 '25 06:03 jinminxi104

I meet same problem in Qwen-2.5-VL 72B with H800

Mar 30 '25 06:03 Minorant

一样的，910B2部署InternVL3_38B也会概率性卡死

Apr 22 '25 07:04 beardog6

重现完成。。。解决中解决中。。。最近被耽搁了

Apr 23 '25 11:04 jinminxi104

fixed by https://github.com/InternLM/lmdeploy/pull/3513

Apr 30 '25 16:04 jinminxi104

各位大佬，我升级到了这个pr以后，离线多卡昇腾推理pipeline还是会卡死。有大佬遇到过离线的推理卡死吗？

使用的lmdeploy版本 0.8.0（具体提交是13b2b5c74ec1d80ec26ee4b8bbcdaec87f406f6c）和dlinfer 0.1.8（具体提交是cf7b6e362c7d13f26be42708fb690cb4354b2eef）

具体离线推理模式是开启一个pipeline（internlm2.5-7b-chat，tp=2，昇腾910B，eagermode和graphmode都试了会卡死），然后一个一个去pipe 200条prompt，每条prompt pipe两遍。在pipe200条的过程中有相当大的概率卡死，HCCL_EXEC_TIMEOUT以后同样报ACL stream synchronize的507048号错误。

May 07 '25 08:05 poorpool

各位大佬，我升级到了这个pr以后，离线多卡昇腾推理pipeline还是会卡死。有大佬遇到过离线的推理卡死吗？

使用的lmdeploy版本 0.8.0（具体提交是13b2b5c74ec1d80ec26ee4b8bbcdaec87f406f6c）和dlinfer 0.1.8（具体提交是cf7b6e362c7d13f26be42708fb690cb4354b2eef）

具体离线推理模式是开启一个pipeline（internlm2.5-7b-chat，tp=2，昇腾910B，eagermode和graphmode都试了会卡死），然后一个一个去pipe 200条prompt，每条prompt pipe两遍。在pipe200条的过程中有相当大的概率卡死，HCCL_EXEC_TIMEOUT以后同样报ACL stream synchronize的507048号错误。

尝试下升级cann到8.1.beta1以上，包括kernel。主要还是用graphmode 还有问题请给一个完整的复现，我们这里再看看。

May 08 '25 15:05 jinminxi104

尝试下升级cann到8.1.beta1以上，包括kernel。主要还是用graphmode 还有问题请给一个完整的复现，我们这里再看看。

了解谢谢，我现在是公司的公用cann 8.0.0，过几天我想办法升级一把

May 08 '25 15:05 poorpool

各位大佬，我升级到了这个pr以后，离线多卡昇腾推理pipeline还是会卡死。有大佬遇到过离线的推理卡死吗？使用的lmdeploy版本 0.8.0（具体提交是13b2b5c74ec1d80ec26ee4b8bbcdaec87f406f6c）和dlinfer 0.1.8（具体提交是cf7b6e362c7d13f26be42708fb690cb4354b2eef）具体离线推理模式是开启一个pipeline（internlm2.5-7b-chat，tp=2，昇腾910B，eagermode和graphmode都试了会卡死），然后一个一个去pipe 200条prompt，每条prompt pipe两遍。在pipe200条的过程中有相当大的概率卡死，HCCL_EXEC_TIMEOUT以后同样报ACL stream synchronize的507048号错误。

尝试下升级cann到8.1.beta1以上，包括kernel。主要还是用graphmode 还有问题请给一个完整的复现，我们这里再看看。

910b1上的驱动版本为24.1.rc2.b030，使用lmdeploy官方提供docker crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:910b-latest ，部署internvl_38b直接卡死

May 08 '25 15:05 linuxmi

各位大佬，我升级到了这个pr以后，离线多卡昇腾推理pipeline还是会卡死。有大佬遇到过离线的推理卡死吗？使用的lmdeploy版本 0.8.0（具体提交是13b2b5c74ec1d80ec26ee4b8bbcdaec87f406f6c）和dlinfer 0.1.8（具体提交是cf7b6e362c7d13f26be42708fb690cb4354b2eef）具体离线推理模式是开启一个pipeline（internlm2.5-7b-chat，tp=2，昇腾910B，eagermode和graphmode都试了会卡死），然后一个一个去pipe 200条prompt，每条prompt pipe两遍。在pipe200条的过程中有相当大的概率卡死，HCCL_EXEC_TIMEOUT以后同样报ACL stream synchronize的507048号错误。

尝试下升级cann到8.1.beta1以上，包括kernel。主要还是用graphmode 还有问题请给一个完整的复现，我们这里再看看。

910b1上的驱动版本为24.1.rc2.b030，使用lmdeploy官方提供docker crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:910b-latest ，部署internvl_38b直接卡死

有具体的复现代码吗？ ascend的多卡是无法退出的，如果是离线的情况，print结果的时候加上flush=true。后续无法退出，或者带着错误退出是正常的

May 08 '25 15:05 jinminxi104

尝试下升级cann到8.1.beta1以上，包括kernel。主要还是用graphmode 还有问题请给一个完整的复现，我们这里再看看。

了解谢谢，我现在是公司的公用cann 8.0.0，过几天我想办法升级一把

试下

我试了

完整跑完

May 09 '25 06:05 jinminxi104

尝试下升级cann到8.1.beta1以上，包括kernel。主要还是用graphmode 还有问题请给一个完整的复现，我们这里再看看。

了解谢谢，我现在是公司的公用cann 8.0.0，过几天我想办法升级一把

试下

我试了

完整跑完

这个是在docker跑的吗，我在docker镜像 crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:910b-latest + 驱动版本 24.1.rc2.b030 + 910b1跑的，跑到8个request会卡死， docker启动命令 docker run -it -u root --entrypoint=/bin/bash
--shm-size 32g
--privileged
--device=/dev/davinci_manager
--device=/dev/devmm_svm
--device=/dev/hisi_hdc
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver
-v /usr/local/dcmi:/usr/local/dcmi
-v /usr/local/Ascend/toolbox:/usr/local/Ascend/toolbox
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi
-v /nvme2:/nvme2
-v /nvme3:/nvme3
--env ASCEND_RT_VISIBLE_DEVICES=2,3
--env lmdy__cache_max_entry_count=0.8
--env lmdy__server_port=23333
--env lmdy__tp=2
--env lmdy__device=ascend
--env lmdy__backend=pytorch
--env lmdy__model_name=taichuvl
--env lmdy__reasoning_parser=deepseek-r1
--name lmd_v72_no_secret_online
crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:910b-latest

May 09 '25 10:05 linuxmi

尝试下升级cann到8.1.beta1以上，包括kernel。主要还是用graphmode 还有问题请给一个完整的复现，我们这里再看看。

了解谢谢，我现在是公司的公用cann 8.0.0，过几天我想办法升级一把

试下我试了完整跑完

这个是在docker跑的吗，我在docker镜像 crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:910b-latest + 驱动版本 24.1.rc2.b030 + 910b1跑的，跑到8个request会卡死， docker启动命令 docker run -it -u root --entrypoint=/bin/bash --shm-size 32g --privileged --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/Ascend/toolbox:/usr/local/Ascend/toolbox -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi -v /nvme2:/nvme2 -v /nvme3:/nvme3 --env ASCEND_RT_VISIBLE_DEVICES=2,3 --env lmdy__cache_max_entry_count=0.8 --env lmdy__server_port=23333 --env lmdy__tp=2 --env lmdy__device=ascend --env lmdy__backend=pytorch --env lmdy__model_name=taichuvl --env lmdy__reasoning_parser=deepseek-r1 --name lmd_v72_no_secret_online crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:910b-latest

我是在docker里面跑的。你的是b1的话，是不是intel cpu？我这个是kunpeng cpu用的。intel可能需要你自己根据docker file折腾一下。我们没有intel cpu的机器。。。

May 09 '25 11:05 jinminxi104

各位大佬，我升级到了这个pr以后，离线多卡昇腾推理pipeline还是会卡死。有大佬遇到过离线的推理卡死吗？使用的lmdeploy版本 0.8.0（具体提交是13b2b5c74ec1d80ec26ee4b8bbcdaec87f406f6c）和dlinfer 0.1.8（具体提交是cf7b6e362c7d13f26be42708fb690cb4354b2eef）具体离线推理模式是开启一个pipeline（internlm2.5-7b-chat，tp=2，昇腾910B，eagermode和graphmode都试了会卡死），然后一个一个去pipe 200条prompt，每条prompt pipe两遍。在pipe200条的过程中有相当大的概率卡死，HCCL_EXEC_TIMEOUT以后同样报ACL stream synchronize的507048号错误。

尝试下升级cann到8.1.beta1以上，包括kernel。主要还是用graphmode 还有问题请给一个完整的复现，我们这里再看看。

910b1上的驱动版本为24.1.rc2.b030，使用lmdeploy官方提供docker crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:910b-latest ，部署internvl_38b直接卡死

估计你是intel机器，镜像不适用

May 09 '25 16:05 jinminxi104

ascend上多卡卡死的问题还是没有彻底解决。 #3513 修复了图模式的bug。但是多卡卡死对于eager模式和图模式都仍然存在。我在cann8.1.beta1的环境下，测试了qwen2.5-3b模型，对于eager模式和图模式，都会大概率会卡死。单卡则eager模式和图模式都正常。

python -m lmdeploy serve api_server qwen2.5-3b --backend pytorch --device ascend --tp 2

May 16 '25 03:05 zzhaowendao

ascend上多卡卡死的问题还是没有彻底解决。 #3513 修复了图模式的bug。但是多卡卡死对于eager模式和图模式都仍然存在。我在cann8.1.beta1的环境下，测试了qwen2.5-3b模型，对于eager模式和图模式，都会大概率会卡死。单卡则eager模式和图模式都正常。

python -m lmdeploy serve api_server qwen2.5-3b --backend pytorch --device ascend --tp 2

初步看来可以用ray启动来解决这个问题，我们也还在进一步压测，大家可以试试单机多卡

启动ray export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 ray start --head --port=8989
使用ray启动ascend上的多卡 export ASCEND_RANK_TABLE_FILE_PATH=xxx export LMDEPLOY_EXECUTOR_BACKEND=ray

rank table写法参照 https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/hccl/hcclug/hcclug_000014.html

{ "status":"completed", "version":"1.0", "server_count":"1", "server_list": [ { "server_id":"10.20.30.40", "device":[ { "device_id":"0", "device_ip":"100.1.2.3", "rank_id":"0" }, { "device_id":"1", "device_ip":"100.1.2.4", "rank_id":"1" } ] } ] }

May 30 '25 02:05 jinminxi104

有时候，我们还碰到过掉卡的现象。多卡情况下，某张卡显存占用突然掉下去了。。。

May 30 '25 02:05 jinminxi104

ascend上多卡卡死的问题还是没有彻底解决。 #3513 修复了图模式的bug。但是多卡卡死对于eager模式和图模式都仍然存在。我在cann8.1.beta1的环境下，测试了qwen2.5-3b模型，对于eager模式和图模式，都会大概率会卡死。单卡则eager模式和图模式都正常。 python -m lmdeploy serve api_server qwen2.5-3b --backend pytorch --device ascend --tp 2

初步看来可以用ray启动来解决这个问题，我们也还在进一步压测，大家可以试试单机多卡

启动ray export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 ray start --head --port=8989

使用ray启动ascend上的多卡 export ASCEND_RANK_TABLE_FILE_PATH=xxx export LMDEPLOY_EXECUTOR_BACKEND=ray

rank table写法参照 https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/hccl/hcclug/hcclug_000014.html

{ "status":"completed", "version":"1.0", "server_count":"1", "server_list": [ { "server_id":"10.20.30.40", "device":[ { "device_id":"0", "device_ip":"100.1.2.3", "rank_id":"0" }, { "device_id":"1", "device_ip":"100.1.2.4", "rank_id":"1" } ] } ] }

这个方法在310P设备上同样适用，⚠️310P机器单机多卡的情况下rank_table可以不设置device_ip，310P机器多机还没有测试过，理论上应该可以，欢迎尝试

Jun 03 '25 07:06 JackWeiw

@JackWeiw 按照你的方法，在310P单机多卡的环境下进行测试，结果如下：

环境： ascend 300v pro双卡 cann 8.1.RC1 https://github.com/DeepLink-org/dlinfer/pull/219 之后的dlinfer，并且增加了https://github.com/DeepLink-org/dlinfer/pull/225 #227的补丁。最新lmdepoly

export LMDEPLOY_EXECUTOR_BACKEND=ray export ASCEND_RANK_TABLE_FILE_PATH=ranktable.json python -m lmdeploy serve api_server qwen3-8b --backend pytorch --device ascend --tp 2

启动起来并无报错。开始聊天终端报错如下： (RayWorkerWrapper pid=1458839) [2025-06-05 17:19:04.152] [dicp] [error] [model.cpp:302] op command execute node[9] fail, error code: 0 (RayWorkerWrapper pid=1458839) please set DICP_USE_TORCH_NPU_LAUNCHER=0 to avoid this error 另外聊天输出token全部是乱码。

同样的环境，tp=1则可以正常聊天。python -m lmdeploy serve api_server qwen3-8b --backend pytorch --device ascend --tp 1

Jun 05 '25 09:06 zzhaowendao

@JackWeiw 按照你的方法，在310P单机多卡的环境下进行测试，结果如下：

环境： ascend 300v pro双卡 cann 8.1.RC1 DeepLink-org/dlinfer#219 之后的dlinfer，并且增加了DeepLink-org/dlinfer#225 #227的补丁。最新lmdepoly

export LMDEPLOY_EXECUTOR_BACKEND=ray export ASCEND_RANK_TABLE_FILE_PATH=ranktable.json python -m lmdeploy serve api_server qwen3-8b --backend pytorch --device ascend --tp 2

启动起来并无报错。开始聊天终端报错如下： (RayWorkerWrapper pid=1458839) [2025-06-05 17:19:04.152] [dicp] [error] [model.cpp:302] op command execute node[9] fail, error code: 0 (RayWorkerWrapper pid=1458839) please set DICP_USE_TORCH_NPU_LAUNCHER=0 to avoid this error 另外聊天输出token全部是乱码。

同样的环境，tp=1则可以正常聊天。python -m lmdeploy serve api_server qwen3-8b --backend pytorch --device ascend --tp 1

我们的测试环境是300IDuo，不确定是不是芯片问题。你这个大概率是allreduce那里的问题。尝试下打开下面三个log，看看是哪个算子问题 export ATB_LOG_TO_STDOUT=1 export ATB_LOG_LEVEL=INFO export DICP_LOG_LEVEL=INFO

Jun 06 '25 05:06 jinminxi104

@JackWeiw 按照你的方法，在310P单机多卡的环境下进行测试，结果如下：

环境： ascend 300v pro双卡 cann 8.1.RC1 DeepLink-org/dlinfer#219 之后的dlinfer，并且增加了DeepLink-org/dlinfer#225 #227的补丁。最新lmdepoly

export LMDEPLOY_EXECUTOR_BACKEND=ray export ASCEND_RANK_TABLE_FILE_PATH=ranktable.json python -m lmdeploy serve api_server qwen3-8b --backend pytorch --device ascend --tp 2

启动起来并无报错。开始聊天终端报错如下： (RayWorkerWrapper pid=1458839) [2025-06-05 17:19:04.152] [dicp] [error] [model.cpp:302] op command execute node[9] fail, error code: 0 (RayWorkerWrapper pid=1458839) please set DICP_USE_TORCH_NPU_LAUNCHER=0 to avoid this error 另外聊天输出token全部是乱码。

同样的环境，tp=1则可以正常聊天。python -m lmdeploy serve api_server qwen3-8b --backend pytorch --device ascend --tp 1

https://github.com/DeepLink-org/dlinfer/pull/227 https://github.com/InternLM/lmdeploy/commit/1384a12dddf00794c86916ab144a3331bd723cc8 建议应用以上Dlinfer patch以及LMDeploy的commit，并且使用如下CANN软件栈，在此情况下以测试可以丝滑启用Ray进行多卡推理。Qwen以及InternLM系列语言模型已经经过测试稳定，VL模型还未经过测试

Jun 06 '25 06:06 JackWeiw

@jinminxi104 输出atb和dicp的log，输出一对信息，报错和之前一样。估计大概率是芯片的问题。我换成i duo测试吧。 @JackWeiw 我这边cann的版本是8.1.RC1，对应版本的kernel和nnal的包都安装了，应该没问题吧？

Jun 06 '25 08:06 zzhaowendao

ascend上多卡卡死的问题还是没有彻底解决。 #3513 修复了图模式的bug。但是多卡卡死对于eager模式和图模式都仍然存在。我在cann8.1.beta1的环境下，测试了qwen2.5-3b模型，对于eager模式和图模式，都会大概率会卡死。单卡则eager模式和图模式都正常。 python -m lmdeploy serve api_server qwen2.5-3b --backend pytorch --device ascend --tp 2

初步看来可以用ray启动来解决这个问题，我们也还在进一步压测，大家可以试试单机多卡

启动ray export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 ray start --head --port=8989

使用ray启动ascend上的多卡 export ASCEND_RANK_TABLE_FILE_PATH=xxx export LMDEPLOY_EXECUTOR_BACKEND=ray

rank table写法参照 https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/hccl/hcclug/hcclug_000014.html

{ "status":"completed", "version":"1.0", "server_count":"1", "server_list": [ { "server_id":"10.20.30.40", "device":[ { "device_id":"0", "device_ip":"100.1.2.3", "rank_id":"0" }, { "device_id":"1", "device_ip":"100.1.2.4", "rank_id":"1" } ] } ] }

910b双卡，graph mode报错与 @zzhaowendao 一致，切换eager mode则正常。

Jun 10 '25 02:06 sec-an

@zzhaowendao @sec-an 如果使用我们的旧镜像，请确认这个ranktable的正确性（是否是你们机器上的一些ip）。

因为我们在等lmdeploy的发版，目前没有更新镜像，但是你们可以自己在原来的镜像下升级。升级代码编译后，单机不需要ranktable，也不需要上面写的那些环境变量 lmdeploy和dlinfer都升级到最新。 910B下，cann使用8.2 torch和torch npu使用2.3.1 可以直接 docker pull crpi-4crprmm5baj1v8iv.cn-hangzhou.personal.cr.aliyuncs.com/lmdeploy_dlinfer/ascend:910b-latest 300I due下，cann 8.1.RC1， torch和torch npu使用2.3.1 网络问题。。上传镜像失败中

Jun 11 '25 01:06 jinminxi104