vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: Gloo Connection reset by peer

Open thies1006 opened this issue 7 months ago • 1 comments

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L4
GPU 1: NVIDIA L4
GPU 2: NVIDIA L4
GPU 3: NVIDIA L4
GPU 4: NVIDIA L4
GPU 5: NVIDIA L4
GPU 6: NVIDIA L4
GPU 7: NVIDIA L4

Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True


Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

🐛 Describe the bug

I'm running Llama3-70B on two nodes with 8 GPUs each using TP=16. I tried adding the options eager-mode and disable-custom-all-reduce without any success. First ~100 queries are always running fine, but after a while I get this Runtime Error:

(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] Traceback (most recent call last):
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 340, in execute_method
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 64, in start_worker_execution_loop
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     output = self.execute_model(execute_model_req=None)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 249, in execute_model
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     broadcast_data = broadcast_tensor_dict(src=0)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 528, in broadcast_tensor_dict
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     metadata_list = self.broadcast_object(None, src=src)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 390, in broadcast_object
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     torch.distributed.broadcast_object_list(recv,
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     broadcast(object_sizes_tensor, src=src, group=group)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     return func(*args, **kwargs)
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348]     work.wait()
(RayWorkerWrapper pid=191565) ERROR 07-10 17:40:09 worker_base.py:348] RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [172.26.161.177]:50407: Connection reset by peer

thies1006 avatar Jul 10 '24 14:07 thies1006