Qwen2.5 icon indicating copy to clipboard operation
Qwen2.5 copied to clipboard

[Bug]: vllm部署Qwen2.5-72B-Instruct压测出现报错

Open WangJianQ-0118 opened this issue 4 months ago • 1 comments

Model Series

Qwen2.5

What are the models used?

Qwen2.5-72B-Instruct

What is the scenario where the problem happened?

AsyncLLMEngine has failed, terminating server process

Is this a known issue?

  • [X] I have followed the GitHub README.
  • [X] I have checked the Qwen documentation and cannot find an answer there.
  • [X] I have checked the documentation of the related framework and cannot find useful information.
  • [X] I have searched the issues and there is not a similar one.

Information about environment

None

Log output

模型:Qwen2.5-72B-Instruct
vllm版本:0.5.5
机器:L40 8卡
输入:15000token
输出:15000token
并发:5
出现报错:
INFO:     Shutting down
INFO:     Waiting for connections to close. (CTRL+C to force quit)
ERROR 10-10 01:53:56 client.py:412] TimeoutError("Server didn't reply within 5000 ms")
ERROR 10-10 01:53:56 client.py:412] Traceback (most recent call last):
ERROR 10-10 01:53:56 client.py:412]   File "/root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 10-10 01:53:56 client.py:412]     await self.check_health(socket=socket)
ERROR 10-10 01:53:56 client.py:412]   File "/root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health
ERROR 10-10 01:53:56 client.py:412]     await self._send_one_way_rpc_request(
ERROR 10-10 01:53:56 client.py:412]   File "/root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 261, in _send_one_way_rpc_request
ERROR 10-10 01:53:56 client.py:412]     response = await do_rpc_call(socket, request)
ERROR 10-10 01:53:56 client.py:412]   File "/root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 249, in do_rpc_call
ERROR 10-10 01:53:56 client.py:412]     raise TimeoutError("Server didn't reply within "
ERROR 10-10 01:53:56 client.py:412] TimeoutError: Server didn't reply within 5000 ms
CRITICAL 10-10 01:53:56 launcher.py:82] AsyncLLMEngine has failed, terminating server process
INFO:     127.0.0.1:39870 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[rank4]:[E1010 02:03:51.109239396 ProcessGroupNCCL.cpp:607] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22066046, OpType=ALLREDUCE, NumelIn=40960, NumelOut=40960, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
[rank4]:[E1010 02:03:51.109546096 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 22066046, last enqueued NCCL work: 22066203, last completed NCCL work: 22066045.
[rank4]:[E1010 02:03:51.109584339 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 4] Timeout at NCCL work: 22066046, last enqueued NCCL work: 22066203, last completed NCCL work: 22066045.
[rank4]:[E1010 02:03:51.109617505 ProcessGroupNCCL.cpp:621] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E1010 02:03:51.109641551 ProcessGroupNCCL.cpp:627] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E1010 02:03:51.115357418 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22066046, OpType=ALLREDUCE, NumelIn=40960, NumelOut=40960, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f85cbabef86 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f85ccdbb8d2 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f85ccdc2313 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f85ccdc46fc in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1c240 (0x7f8619068240 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x7ea5 (0x7f8622784ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f8621da4b0d in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22066046, OpType=ALLREDUCE, NumelIn=40960, NumelOut=40960, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f85cbabef86 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f85ccdbb8d2 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f85ccdc2313 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f85ccdc46fc in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1c240 (0x7f8619068240 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x7ea5 (0x7f8622784ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f8621da4b0d in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f85cbabef86 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f85cca4da84 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x1c240 (0x7f8619068240 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x7ea5 (0x7f8622784ea5 in /lib64/libpthread.so.0)
frame #4: clone + 0x6d (0x7f8621da4b0d in /lib64/libc.so.6)

Description

Steps to reproduce

This happens to Qwen2.5-xB-Instruct-xxx and xxx. The problem can be reproduced with the following steps:

  1. ...
  2. ...

Expected results

The results are expected to be ...

Attempts to fix

I have tried several ways to fix this, including:

  1. ...
  2. ...

Anything else helpful for investigation

I find that this problem also happens to ... None

WangJianQ-0118 avatar Oct 10 '24 07:10 WangJianQ-0118