vllm
vllm copied to clipboard
[Bug]: Gloo Connection reset by peer
Your current environment
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35
Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA L4
GPU 1: NVIDIA L4
GPU 2: NVIDIA L4
GPU 3: NVIDIA L4
GPU 4: NVIDIA L4
GPU 5: NVIDIA L4
GPU 6: NVIDIA L4
GPU 7: NVIDIA L4
Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
🐛 Describe the bug
I'm running Llama3-70B on two nodes with 8 GPUs each using TP=16. I tried adding the options eager-mode and disable-custom-all-reduce without any success. First ~100 queries are always running fine, but after a while I get this Runtime Error:
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] Traceback (most recent call last):
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 340, in execute_method
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] return executor(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 64, in start_worker_execution_loop
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] output = self.execute_model(execute_model_req=None)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 249, in execute_model
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] broadcast_data = broadcast_tensor_dict(src=0)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 528, in broadcast_tensor_dict
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] metadata_list = self.broadcast_object(None, src=src)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 390, in broadcast_object
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] torch.distributed.broadcast_object_list(recv,
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] broadcast(object_sizes_tensor, src=src, group=group)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] work.wait()
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [172.26.161.177]:50407: Connection reset by peer