vllm [Bug]: NCCL get stuck when instantiating the LLM class.

Your current environment

The output of `python collect_env.py`

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A800 80GB PCIe
GPU 1: NVIDIA A800 80GB PCIe
GPU 2: NVIDIA A800 80GB PCIe
GPU 3: NVIDIA A800 80GB PCIe
GPU 4: NVIDIA A800 80GB PCIe
GPU 5: NVIDIA A800 80GB PCIe
GPU 6: NVIDIA A800 80GB PCIe
GPU 7: NVIDIA A800 80GB PCIe

Nvidia driver version: 560.35.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               144
On-line CPU(s) list:                  0-143
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz
CPU family:                           6
Model:                                106
Thread(s) per core:                   2
Core(s) per socket:                   36
Socket(s):                            2
Stepping:                             6
CPU max MHz:                          3500.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization:                       VT-x
L1d cache:                            3.4 MiB (72 instances)
L1i cache:                            2.3 MiB (72 instances)
L2 cache:                             90 MiB (72 instances)
L3 cache:                             108 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-35,72-107
NUMA node1 CPU(s):                    36-71,108-143
Vulnerability Gather data sampling:   Mitigation; Microcode
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchaudio==2.4.1
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] _anaconda_depends         2024.06             py312_mkl_2    https://repo.anaconda.com/pkgs/main
[conda] blas                      1.0                         mkl    https://repo.anaconda.com/pkgs/main
[conda] mkl                       2023.1.0         h213fc3f_46344    https://repo.anaconda.com/pkgs/main
[conda] mkl-service               2.4.0           py312h5eee18b_1    https://repo.anaconda.com/pkgs/main
[conda] mkl_fft                   1.3.8           py312h5eee18b_0    https://repo.anaconda.com/pkgs/main
[conda] mkl_random                1.2.4           py312hdb19cb5_0    https://repo.anaconda.com/pkgs/main
[conda] numpy                     1.26.4          py312hc5e2394_0    https://repo.anaconda.com/pkgs/main
[conda] numpy-base                1.26.4          py312h0da6c21_0    https://repo.anaconda.com/pkgs/main
[conda] numpydoc                  1.7.0           py312h06a4308_0    https://repo.anaconda.com/pkgs/main
[conda] pyzmq                     25.1.2          py312h6a678d5_0    https://repo.anaconda.com/pkgs/main
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.0@32e7db25365415841ebc7c4215851743fbb1bad1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PXB     PXB     PXB     SYS     SYS     SYS     SYS     0-35,72-107     0               N/A
GPU1    PXB      X      PXB     PXB     SYS     SYS     SYS     SYS     0-35,72-107     0               N/A
GPU2    PXB     PXB      X      PIX     SYS     SYS     SYS     SYS     0-35,72-107     0               N/A
GPU3    PXB     PXB     PIX      X      SYS     SYS     SYS     SYS     0-35,72-107     0               N/A
GPU4    SYS     SYS     SYS     SYS      X      PXB     PXB     PXB     36-71,108-143   1               N/A
GPU5    SYS     SYS     SYS     SYS     PXB      X      PXB     PXB     36-71,108-143   1               N/A
GPU6    SYS     SYS     SYS     SYS     PXB     PXB      X      PIX     36-71,108-143   1               N/A
GPU7    SYS     SYS     SYS     SYS     PXB     PXB     PIX      X      36-71,108-143   1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Model Input Dumps

No response

🐛 Describe the bug

NCCL get stuck when instantiating the LLM class. I can't even CTRL + C to stop it. I open all the debug log but I don't know what to do then.

from vllm import LLM
def init_model() -> LLM:
    llm = LLM(
        model="Qwen/Qwen2-7B-Instruct",
        tokenizer_mode="auto",
        trust_remote_code=True,
        download_dir="./.cache",
        tensor_parallel_size=2,  # How many GPUs to use
        gpu_memory_utilization=0.85,
        pipeline_parallel_size=1,
        dtype="bfloat16",
        # max_model_len=20480,  # Model context length
        enable_prefix_caching=True,
        enable_chunked_prefill=False,
        num_scheduler_steps=8,
    )
    return llm
if __name__ == "__main__":
    llm = init_model()
    print(llm.generate("Hello, world!"))

 % ~/anaconda3/envs/psp/bin/python ~/psp/Reasoning-Carefully/main.py
2024-09-13 00:35:48 - INFO - main - The save_path is: ./results/Qwen2-7B/MATH/test/planthensteprefinewoinfo/0913
2024-09-13 00:35:48 - INFO - main - Global variables have been initialized
2024-09-13 00:35:48 - INFO - main - Start logging to huggingface
Token is valid (permission: read).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to ~/.cache/huggingface/token
Login successful
2024-09-13 00:35:48 - INFO - main - Login to huggingface successfully
[INFO|configuration_utils.py:733] 2024-09-13 00:35:48,990 >> loading configuration file config.json from cache at ~/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct/snapshots/f2826a00ceef68f0f2b946d945ecc0477ce4450c/config.json
[INFO|configuration_utils.py:800] 2024-09-13 00:35:48,991 >> Model config Qwen2Config {
  "_name_or_path": "Qwen/Qwen2-7B-Instruct",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.44.2",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

[INFO|image_processing_auto.py:286] 2024-09-13 00:35:49,235 >> Could not locate the image processor configuration file, will try to use the model config instead.
INFO 09-13 00:35:49 config.py:890] Defaulting to use mp for distributed inference
WARNING 09-13 00:35:49 arg_utils.py:880] Enabled BlockSpaceManagerV2 because it is required for multi-step (--num-scheduler-steps > 1)
INFO 09-13 00:35:49 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='Qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir='./.cache', load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=47, served_model_name=Qwen/Qwen2-7B-Instruct, use_v2_block_manager=True, num_scheduler_steps=8, enable_prefix_caching=True, use_async_output_proc=True)
[INFO|tokenization_utils_base.py:2269] 2024-09-13 00:35:49,541 >> loading file vocab.json from cache at ~/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct/snapshots/f2826a00ceef68f0f2b946d945ecc0477ce4450c/vocab.json
[INFO|tokenization_utils_base.py:2269] 2024-09-13 00:35:49,541 >> loading file merges.txt from cache at ~/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct/snapshots/f2826a00ceef68f0f2b946d945ecc0477ce4450c/merges.txt
[INFO|tokenization_utils_base.py:2269] 2024-09-13 00:35:49,541 >> loading file tokenizer.json from cache at ~/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct/snapshots/f2826a00ceef68f0f2b946d945ecc0477ce4450c/tokenizer.json
[INFO|tokenization_utils_base.py:2269] 2024-09-13 00:35:49,541 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2269] 2024-09-13 00:35:49,541 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:2269] 2024-09-13 00:35:49,541 >> loading file tokenizer_config.json from cache at ~/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct/snapshots/f2826a00ceef68f0f2b946d945ecc0477ce4450c/tokenizer_config.json
[INFO|tokenization_utils_base.py:2513] 2024-09-13 00:35:49,706 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:993] 2024-09-13 00:35:49,961 >> loading configuration file generation_config.json from cache at ~/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct/snapshots/f2826a00ceef68f0f2b946d945ecc0477ce4450c/generation_config.json
[INFO|configuration_utils.py:1038] 2024-09-13 00:35:49,961 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.05,
  "temperature": 0.7,
  "top_k": 20,
  "top_p": 0.8
}

WARNING 09-13 00:35:49 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 72 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-13 00:35:49 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
WARNING 09-13 00:35:49 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 09-13 00:35:49 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-17ffcfe40d29405d9a707d59e9534c70/VLLM_TRACE_FUNCTION_for_process_3740804_thread_127505140479808_at_2024-09-13_00:35:49.983688.log
(VllmWorkerProcess pid=3740967) WARNING 09-13 00:35:49 logger.py:147] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(VllmWorkerProcess pid=3740967) INFO 09-13 00:35:49 logger.py:151] Trace frame log is saved to /tmp/vllm/vllm-instance-17ffcfe40d29405d9a707d59e9534c70/VLLM_TRACE_FUNCTION_for_process_3740967_thread_127505140479808_at_2024-09-13_00:35:49.984074.log
WARNING 09-13 00:35:50 registry.py:190] `mm_limits` has already been set for model=Qwen/Qwen2-7B-Instruct, and will be overwritten by the new values.
(VllmWorkerProcess pid=3740967) WARNING 09-13 00:35:50 registry.py:190] `mm_limits` has already been set for model=Qwen/Qwen2-7B-Instruct, and will be overwritten by the new values.
(VllmWorkerProcess pid=3740967) INFO 09-13 00:35:51 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 09-13 00:35:51 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3740967) INFO 09-13 00:35:51 utils.py:977] Found nccl from library libnccl.so.2
INFO 09-13 00:35:51 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=3740967) INFO 09-13 00:35:51 pynccl.py:63] vLLM is using nccl==2.20.5
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Bootstrap : Using eno2:10.249.46.87<0>
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.20.5+cuda12.4
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO cudaDriverVersion 12060
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Bootstrap : Using eno2:10.249.46.87<0>
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Failed to open libibverbs.so[.1]
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO NET/Socket : Using [0]eno2:10.249.46.87<0> [1]enxbe3af2b6059f:169.254.3.1<0>
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Using non-device net plugin version 0
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Using network Socket
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Failed to open libibverbs.so[.1]
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO NET/Socket : Using [0]eno2:10.249.46.87<0> [1]enxbe3af2b6059f:169.254.3.1<0>
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Using non-device net plugin version 0
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Using network Socket
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO comm 0xc8f13b0 rank 1 nranks 2 cudaDev 1 nvmlDev 3 busId 57000 commId 0xb587bdc561ea636b - Init START
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO comm 0xb923230 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 56000 commId 0xb587bdc561ea636b - Init START
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Setting affinity for GPU 2 to 0fff,ffffff00,0000000f,ffffffff
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Setting affinity for GPU 3 to 0fff,ffffff00,0000000f,ffffffff
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO comm 0xc8f13b0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO P2P Chunksize set to 131072
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/IPC
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO comm 0xb923230 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Channel 00/04 :    0   1
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Channel 01/04 :    0   1
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Channel 02/04 :    0   1
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Channel 03/04 :    0   1
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO P2P Chunksize set to 131072
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/IPC
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Channel 02/0 : 1[3] -> 0[2] via P2P/IPC
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Channel 03/0 : 1[3] -> 0[2] via P2P/IPC
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/IPC
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/IPC
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Channel 02/0 : 0[2] -> 1[3] via P2P/IPC
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Channel 03/0 : 0[2] -> 1[3] via P2P/IPC
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Connected all rings
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO Connected all trees
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Connected all rings
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO Connected all trees
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu-SYS-420GP-TNR:3740804:3740804 [0] NCCL INFO comm 0xb923230 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 56000 commId 0xb587bdc561ea636b - Init COMPLETE
ubuntu-SYS-420GP-TNR:3740967:3740967 [1] NCCL INFO comm 0xc8f13b0 rank 1 nranks 2 cudaDev 1 nvmlDev 3 busId 57000 commId 0xb587bdc561ea636b - Init COMPLETE

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Sep 12 '24 16:09 pspdada

May be related to #5484, please take a look at that thread.

Sep 12 '24 16:09 DarkLight1337

I see you are using multi-step, so could also be related to https://github.com/vllm-project/vllm/pull/8403, now merged.

Sep 12 '24 23:09 SolitaryThinker

I see you are using multi-step, so could also be related to #8403, now merged.

I regret to say that this solution did not work for my problem.

Sep 13 '24 01:09 pspdada

I see you are using multi-step, so could also be related to #8403, now merged.

I ran the test document: it seems to be hanging at the same place, but this time I was able to terminate the execution using CTRL+C and saw where it was running. Do you have any thoughts on this?

The test.py:

# Test PyTorch NCCL
import torch
import torch.distributed as dist

dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([
    1,
] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch NCCL is successful!")

# Test PyTorch GLOO
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
cpu_data = torch.FloatTensor([
    1,
] * 128)
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
value = cpu_data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch GLOO is successful!")

# Test vLLM NCCL, with cuda graph
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator

pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
pynccl.disabled = False

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    data.fill_(1)
    pynccl.all_reduce(data, stream=s)
    value = data.mean().item()
    assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL is successful!")

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(cuda_graph=g, stream=s):
    pynccl.all_reduce(data, stream=torch.cuda.current_stream())

data.fill_(1)
g.replay()
torch.cuda.current_stream().synchronize()
value = data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL with cuda graph is successful!")

dist.destroy_process_group(gloo_group)
dist.destroy_process_group()

The result: Note that I kicked twice <CTRL+C> then finally can stop the progress.

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
ubuntu-SYS-420GP-TNR:3888119:3888119 [0] NCCL INFO Bootstrap : Using eno2:10.249.46.87<0>
ubuntu-SYS-420GP-TNR:3888119:3888119 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ubuntu-SYS-420GP-TNR:3888119:3888119 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.20.5+cuda12.4
ubuntu-SYS-420GP-TNR:3888120:3888120 [1] NCCL INFO cudaDriverVersion 12060
ubuntu-SYS-420GP-TNR:3888120:3888120 [1] NCCL INFO Bootstrap : Using eno2:10.249.46.87<0>
ubuntu-SYS-420GP-TNR:3888120:3888120 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Failed to open libibverbs.so[.1]
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO NET/Socket : Using [0]eno2:10.249.46.87<0> [1]enxbe3af2b6059f:169.254.3.1<0>
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Using non-device net plugin version 0
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Using network Socket
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO Failed to open libibverbs.so[.1]
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO NET/Socket : Using [0]eno2:10.249.46.87<0> [1]enxbe3af2b6059f:169.254.3.1<0>
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO Using non-device net plugin version 0
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO Using network Socket
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO comm 0x88d5b20 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4f000 commId 0x6b1dd3dc47dfd2cb - Init START
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO comm 0x6f70c30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0x6b1dd3dc47dfd2cb - Init START
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Setting affinity for GPU 0 to 0fff,ffffff00,0000000f,ffffffff
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO Setting affinity for GPU 1 to 0fff,ffffff00,0000000f,ffffffff
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO comm 0x88d5b20 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO comm 0x6f70c30 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Channel 00/04 :    0   1
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Channel 01/04 :    0   1
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO P2P Chunksize set to 131072
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Channel 02/04 :    0   1
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Channel 03/04 :    0   1
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO P2P Chunksize set to 131072
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Connected all rings
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO Connected all trees
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO Connected all rings
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO Connected all trees
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu-SYS-420GP-TNR:3888120:3888172 [1] NCCL INFO comm 0x6f70c30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 52000 commId 0x6b1dd3dc47dfd2cb - Init COMPLETE
ubuntu-SYS-420GP-TNR:3888119:3888171 [0] NCCL INFO comm 0x88d5b20 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4f000 commId 0x6b1dd3dc47dfd2cb - Init COMPLETE
^CW0913 09:54:23.519000 130705762707264 torch/distributed/elastic/agent/server/api.py:688] Received Signals.SIGINT death signal, shutting down workers
W0913 09:54:23.519000 130705762707264 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3888119 closing signal SIGINT
W0913 09:54:23.519000 130705762707264 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3888120 closing signal SIGINT
^CW0913 09:54:23.699000 130705762707264 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3888119 closing signal SIGTERM
W0913 09:54:23.700000 130705762707264 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3888120 closing signal SIGTERM
Traceback (most recent call last):
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
    result = self._invoke_run(role)
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 835, in _invoke_run
    time.sleep(monitor_interval)
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3888043 got signal: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/anaconda3/envs/psp/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "~/anaconda3/envs/psp/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/run.py", line 905, in <module>
    main()
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 689, in run
    self._shutdown(e.sigval)
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 347, in _shutdown
    self._pcontext.close(death_sig)
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 544, in close
    self._close(death_sig=death_sig, timeout=timeout)
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 868, in _close
    handler.proc.wait(time_to_wait)
  File "~/anaconda3/envs/psp/lib/python3.10/subprocess.py", line 1209, in wait
    return self._wait(timeout=timeout)
  File "~/anaconda3/envs/psp/lib/python3.10/subprocess.py", line 1953, in _wait
    time.sleep(delay)
  File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3888043 got signal: 2

Sep 13 '24 01:09 pspdada

More info:

I tried uninstalling PyTorch installed via conda and then installed it using pip, but this didn't make any difference.

I tried to update the vllm from v0.6.0 to v0.6.1 post1, nothing changed.

This issue happened only when I use tensor_parallel. When tensor_parallel_size = 1 it won't happen.

Sep 13 '24 02:09 pspdada

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Dec 13 '24 02:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Jan 13 '25 02:01 github-actions[bot]