vllm [Bug]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5

Your current environment

docker exec:
python3 api_server.py  --served-model-name qwen-7b-chat --model /data/models/qwen1.5-110B-Chat-GPTQ-Int4/ --quantization gptq  --max-model-len 16384 --tensor-parallel-size 2

🐛 Describe the bug

INFO 05-06 08:55:09 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/data/models/qwen1.5-110B-Chat-GPTQ-Int4/', speculative_config=None, tokenizer='/data/models/qwen1.5-110B-Chat-GPTQ-Int4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=qwen-7b-chat)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 05-06 08:55:14 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(RayWorkerWrapper pid=5486) INFO 05-06 08:55:14 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 05-06 08:55:15 selector.py:27] Using FlashAttention-2 backend.
(RayWorkerWrapper pid=5486) INFO 05-06 08:55:15 selector.py:27] Using FlashAttention-2 backend.
INFO 05-06 08:55:17 pynccl_utils.py:43] vLLM is using nccl==2.18.1
(RayWorkerWrapper pid=5486) INFO 05-06 08:55:17 pynccl_utils.py:43] vLLM is using nccl==2.18.1
INFO 05-06 08:55:18 utils.py:118] generating GPU P2P access cache for in /root/.config/vllm/gpu_p2p_access_cache_for_5,7.json
INFO 05-06 08:55:19 utils.py:132] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_5,7.json
(RayWorkerWrapper pid=5486) INFO 05-06 08:55:19 utils.py:132] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_5,7.json
ERROR 05-06 08:55:19 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 05-06 08:55:19 worker_base.py:145] Traceback (most recent call last):
ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 137, in execute_method
ERROR 05-06 08:55:19 worker_base.py:145]     return executor(*args, **kwargs)
ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 111, in init_device
ERROR 05-06 08:55:19 worker_base.py:145]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 310, in init_worker_distributed_environment
ERROR 05-06 08:55:19 worker_base.py:145]     init_custom_ar()
ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 79, in init_custom_ar
ERROR 05-06 08:55:19 worker_base.py:145]     _CA_HANDLE = CustomAllreduce(rank, world_size, full_nvlink)
ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 213, in __init__
ERROR 05-06 08:55:19 worker_base.py:145]     handles, offsets = self._get_ipc_meta(self.meta)
ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 226, in _get_ipc_meta
ERROR 05-06 08:55:19 worker_base.py:145]     return self._gather_ipc_meta(shard_data)
ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 230, in _gather_ipc_meta
ERROR 05-06 08:55:19 worker_base.py:145]     dist.all_gather_object(all_data, shard_data)
ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
ERROR 05-06 08:55:19 worker_base.py:145]     return func(*args, **kwargs)
ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2436, in all_gather_object
ERROR 05-06 08:55:19 worker_base.py:145]     all_gather(object_size_list, local_size, group=group)
ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
ERROR 05-06 08:55:19 worker_base.py:145]     return func(*args, **kwargs)
ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2854, in all_gather
ERROR 05-06 08:55:19 worker_base.py:145]     work = default_pg.allgather([tensor_list], [tensor])
ERROR 05-06 08:55:19 worker_base.py:145] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ERROR 05-06 08:55:19 worker_base.py:145] ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
ERROR 05-06 08:55:19 worker_base.py:145] Last error:
ERROR 05-06 08:55:19 worker_base.py:145] Error while creating shared memory segment /dev/shm/nccl-7zXYBo (size 9637888)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/zhaoyin/project/qwen/api_server.py", line 164, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 324, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 160, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 300, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 43, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 164, in _init_workers_ray
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 234, in _run_workers
[rank0]:     driver_worker_output = self.driver_worker.execute_method(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 146, in execute_method
[rank0]:     raise e
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 137, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 111, in init_device
[rank0]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 310, in init_worker_distributed_environment
[rank0]:     init_custom_ar()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 79, in init_custom_ar
[rank0]:     _CA_HANDLE = CustomAllreduce(rank, world_size, full_nvlink)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 213, in __init__
[rank0]:     handles, offsets = self._get_ipc_meta(self.meta)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 226, in _get_ipc_meta
[rank0]:     return self._gather_ipc_meta(shard_data)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 230, in _gather_ipc_meta
[rank0]:     dist.all_gather_object(all_data, shard_data)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2436, in all_gather_object
[rank0]:     all_gather(object_size_list, local_size, group=group)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2854, in all_gather
[rank0]:     work = default_pg.allgather([tensor_list], [tensor])
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank0]: Last error:
[rank0]: Error while creating shared memory segment /dev/shm/nccl-7zXYBo (size 9637888)
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145] Traceback (most recent call last):
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 137, in execute_method
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 111, in init_device
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]     init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 310, in init_worker_distributed_environment
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]     init_custom_ar()
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 79, in init_custom_ar
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]     _CA_HANDLE = CustomAllreduce(rank, world_size, full_nvlink)
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 213, in __init__
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]     handles, offsets = self._get_ipc_meta(self.meta)
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 226, in _get_ipc_meta
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]     return self._gather_ipc_meta(shard_data)
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 230, in _gather_ipc_meta
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]     dist.all_gather_object(all_data, shard_data)
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]     return func(*args, **kwargs)
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2436, in all_gather_object
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]     all_gather(object_size_list, local_size, group=group)
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]     return func(*args, **kwargs)
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2854, in all_gather
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145]     work = default_pg.allgather([tensor_list], [tensor])
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145] ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145] Last error:
(RayWorkerWrapper pid=5486) ERROR 05-06 08:55:19 worker_base.py:145] Error while creating shared memory segment /dev/shm/nccl-bhqmcv (size 9637888)
(RayWorkerWrapper pid=5486) Exception ignored in: <function CustomAllreduce.__del__ at 0x7f29a6dc5000>
(RayWorkerWrapper pid=5486) Traceback (most recent call last):
(RayWorkerWrapper pid=5486)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 274, in __del__
(RayWorkerWrapper pid=5486)     self.close()
(RayWorkerWrapper pid=5486)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 269, in close
(RayWorkerWrapper pid=5486)     if self._ptr:
(RayWorkerWrapper pid=5486) AttributeError: 'CustomAllreduce' object has no attribute '_ptr'
Exception ignored in: <function CustomAllreduce.__del__ at 0x7fa7fc6d7be0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 274, in __del__
    self.close()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 269, in close
    if self._ptr:
AttributeError: 'CustomAllreduce' object has no attribute '_ptr'
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

May 06 '24 09:05 White-Friday

How can the problem above be solved?

May 06 '24 09:05 White-Friday

The error is

Error while creating shared memory segment /dev/shm/nccl-bhqmcv (size 9637888)

You can try to reach out to nccl https://github.com/NVIDIA/nccl .

Or I guess this might be caused by insufficient space for shared memory, which is documented in the website https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html :

You can either use the ipc=host flag or --shm-size flag to allow the container to access the host’s shared memory. vLLM uses PyTorch, which uses shared memory to share data between processes under the hood, particularly for tensor parallel inference.

May 06 '24 16:05 youkaichao

@youkaichao Thanks a lot

May 07 '24 02:05 White-Friday