Qwen2.5
Qwen2.5 copied to clipboard
[Bug]: vllm部署Qwen2.5-72B-Instruct压测出现报错
Model Series
Qwen2.5
What are the models used?
Qwen2.5-72B-Instruct
What is the scenario where the problem happened?
AsyncLLMEngine has failed, terminating server process
Is this a known issue?
- [X] I have followed the GitHub README.
- [X] I have checked the Qwen documentation and cannot find an answer there.
- [X] I have checked the documentation of the related framework and cannot find useful information.
- [X] I have searched the issues and there is not a similar one.
Information about environment
None
Log output
模型:Qwen2.5-72B-Instruct
vllm版本:0.5.5
机器:L40 8卡
输入:15000token
输出:15000token
并发:5
出现报错:
INFO: Shutting down
INFO: Waiting for connections to close. (CTRL+C to force quit)
ERROR 10-10 01:53:56 client.py:412] TimeoutError("Server didn't reply within 5000 ms")
ERROR 10-10 01:53:56 client.py:412] Traceback (most recent call last):
ERROR 10-10 01:53:56 client.py:412] File "/root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 10-10 01:53:56 client.py:412] await self.check_health(socket=socket)
ERROR 10-10 01:53:56 client.py:412] File "/root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health
ERROR 10-10 01:53:56 client.py:412] await self._send_one_way_rpc_request(
ERROR 10-10 01:53:56 client.py:412] File "/root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 261, in _send_one_way_rpc_request
ERROR 10-10 01:53:56 client.py:412] response = await do_rpc_call(socket, request)
ERROR 10-10 01:53:56 client.py:412] File "/root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 249, in do_rpc_call
ERROR 10-10 01:53:56 client.py:412] raise TimeoutError("Server didn't reply within "
ERROR 10-10 01:53:56 client.py:412] TimeoutError: Server didn't reply within 5000 ms
CRITICAL 10-10 01:53:56 launcher.py:82] AsyncLLMEngine has failed, terminating server process
INFO: 127.0.0.1:39870 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[rank4]:[E1010 02:03:51.109239396 ProcessGroupNCCL.cpp:607] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22066046, OpType=ALLREDUCE, NumelIn=40960, NumelOut=40960, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
[rank4]:[E1010 02:03:51.109546096 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 22066046, last enqueued NCCL work: 22066203, last completed NCCL work: 22066045.
[rank4]:[E1010 02:03:51.109584339 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 4] Timeout at NCCL work: 22066046, last enqueued NCCL work: 22066203, last completed NCCL work: 22066045.
[rank4]:[E1010 02:03:51.109617505 ProcessGroupNCCL.cpp:621] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E1010 02:03:51.109641551 ProcessGroupNCCL.cpp:627] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E1010 02:03:51.115357418 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22066046, OpType=ALLREDUCE, NumelIn=40960, NumelOut=40960, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f85cbabef86 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f85ccdbb8d2 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f85ccdc2313 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f85ccdc46fc in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1c240 (0x7f8619068240 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x7ea5 (0x7f8622784ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f8621da4b0d in /lib64/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22066046, OpType=ALLREDUCE, NumelIn=40960, NumelOut=40960, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f85cbabef86 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f85ccdbb8d2 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f85ccdc2313 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f85ccdc46fc in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1c240 (0x7f8619068240 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x7ea5 (0x7f8622784ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f8621da4b0d in /lib64/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f85cbabef86 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f85cca4da84 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x1c240 (0x7f8619068240 in /root/miniforge3/envs/Qwen2/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x7ea5 (0x7f8622784ea5 in /lib64/libpthread.so.0)
frame #4: clone + 0x6d (0x7f8621da4b0d in /lib64/libc.so.6)
Description
Steps to reproduce
This happens to Qwen2.5-xB-Instruct-xxx and xxx. The problem can be reproduced with the following steps:
- ...
- ...
Expected results
The results are expected to be ...
Attempts to fix
I have tried several ways to fix this, including:
- ...
- ...
Anything else helpful for investigation
I find that this problem also happens to ... None