vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: Shutdown during Qwen2.5-VL-72B inference on 4 A800s

Open nku-zhichengzhang opened this issue 7 months ago • 2 comments

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.5.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.1
Libc version: glibc-2.31

Python version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A800-SXM4-80GB
GPU 1: NVIDIA A800-SXM4-80GB
GPU 2: NVIDIA A800-SXM4-80GB
GPU 3: NVIDIA A800-SXM4-80GB
GPU 4: NVIDIA A800-SXM4-80GB
GPU 5: NVIDIA A800-SXM4-80GB
GPU 6: NVIDIA A800-SXM4-80GB
GPU 7: NVIDIA A800-SXM4-80GB

Nvidia driver version: 535.161.08
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          116
On-line CPU(s) list:             0-115
Thread(s) per core:              2
Core(s) per socket:              29
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
Stepping:                        6
CPU MHz:                         2599.996
BogoMIPS:                        5199.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       2.7 MiB
L1i cache:                       1.8 MiB
L2 cache:                        72.5 MiB
L3 cache:                        96 MiB
NUMA node0 CPU(s):               0-57
NUMA node1 CPU(s):               58-115
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid md_clear arch_capabilities

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pynvml==12.0.0
[pip3] pytorchvideo==0.1.5
[pip3] pyzmq==26.2.1
[pip3] sentence-transformers==3.4.1
[pip3] torch==2.5.0+cu124
[pip3] torchaudio==2.5.0+cu124
[pip3] torchvision==0.20.0+cu124
[pip3] transformers==4.49.0
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-ml-py              12.570.86                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pynvml                    12.0.0                   pypi_0    pypi
[conda] pytorchvideo              0.1.5                    pypi_0    pypi
[conda] pyzmq                     26.2.1                   pypi_0    pypi
[conda] sentence-transformers     3.4.1                    pypi_0    pypi
[conda] torch                     2.5.0+cu124              pypi_0    pypi
[conda] torchaudio                2.5.0+cu124              pypi_0    pypi
[conda] torchcodec                0.1.0                    pypi_0    pypi
[conda] torchvision               0.20.0+cu124             pypi_0    pypi
[conda] transformers              4.49.0                   pypi_0    pypi
[conda] transformers-stream-generator 0.0.5                    pypi_0    pypi
[conda] triton                    3.1.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.7.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV8     NV8     NV8     NV8     NV8     NV8     NV8     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-57    0               N/A
GPU1    NV8      X      NV8     NV8     NV8     NV8     NV8     NV8     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-57    0               N/A
GPU2    NV8     NV8      X      NV8     NV8     NV8     NV8     NV8     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-57    0               N/A
GPU3    NV8     NV8     NV8      X      NV8     NV8     NV8     NV8     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-57    0               N/A
GPU4    NV8     NV8     NV8     NV8      X      NV8     NV8     NV8     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     58-115  1               N/A
GPU5    NV8     NV8     NV8     NV8     NV8      X      NV8     NV8     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     58-115  1               N/A
GPU6    NV8     NV8     NV8     NV8     NV8     NV8      X      NV8     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     58-115  1               N/A
GPU7    NV8     NV8     NV8     NV8     NV8     NV8     NV8      X      SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     58-115  1               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB     SYS     SYS     SYS     SYS
NIC2    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB     SYS     SYS     SYS     SYS
NIC3    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB     SYS     SYS     SYS     SYS
NIC4    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X      SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB
NIC6    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB
NIC7    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB
NIC8    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8

NVIDIA_VISIBLE_DEVICES=GPU-461efaa0-904b-b9bc-1d9e-a8211ab74248,GPU-6b4eee32-4b9d-64d7-bc92-c7967a752cc5,GPU-40e681c0-37a8-6846-bc72-6da1ae62ea6d,GPU-2bb0e965-8e04-c6ce-4f84-e536b13d378c,GPU-50e841f5-96a3-a386-d38b-9332a4ffe2b4,GPU-eac870db-86c9-979b-8560-906489457607,GPU-03bce450-81a1-5f7a-cab2-07453226d367,GPU-e47c9312-5289-a447-efeb-888130c4c520
NVIDIA_REQUIRE_CUDA=cuda>=12.3 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
NCCL_VERSION=2.20.5-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
CUDA_VERSION=12.3.2
LD_LIBRARY_PATH=/home/zhangzhicheng03/anaconda3/envs/videva/lib/python3.11/site-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NCCL_IB_DISABLE=1
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

Directly shutdown after outputting NCCL info level message.

Image

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

nku-zhichengzhang avatar Apr 24 '25 09:04 nku-zhichengzhang

Can you follow https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html to get more detailed logs? cc @youkaichao

DarkLight1337 avatar Apr 24 '25 10:04 DarkLight1337

Can you follow https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html to get more detailed logs? cc @youkaichao

Thx for the reply, and here is the logging.

run sh: /home/zhangzhicheng03/anaconda3/envs/videva/bin/python /home/zhangzhicheng03/code/face-llm/ms-swift/swift/cli/infer.py --ckpt_dir /home/zhangzhicheng03/HuggingFace/VideoLLM/models--Qwen--Qwen2.5-VL-72B-Instruct --infer_backend vllm --val_dataset /home/zhangzhicheng03/code/face-llm/all_anno_clean_v2/QA_train_split/QA_training_21.json --gpu_memory_utilization 0.8 --torch_dtype bfloat16 --max_new_tokens 2048 --max-num-seqs 16 --streaming False --max_batch_size 8 --tensor_parallel_size 4 --result_path /home/zhangzhicheng03/code/face-llm/qwenvl/QA_ver_res_train/QA_training_21.json --attn_impl flash_attn --limit_mm_per_prompt {"image": 0, "video": 1} --max_model_len 32768 --model_type qwen2_5_vl [INFO:swift] Successfully registered /home/zhangzhicheng03/code/face-llm/ms-swift/swift/llm/dataset/data/dataset_info.json [WARNING:swift] The --ckpt_dir parameter will be removed in ms-swift>=3.2. Please use --model, --adapters. [INFO:swift] rank: -1, local_rank: -1, world_size: 1, local_world_size: 1 [INFO:swift] Loading the model using model_dir: /home/zhangzhicheng03/HuggingFace/VideoLLM/models--Qwen--Qwen2.5-VL-72B-Instruct [INFO:swift] Because len(args.val_dataset) > 0, setting split_dataset_ratio: 0.0 [INFO:swift] Setting args.eval_human: False [INFO:swift] Global seed set to 42 [INFO:swift] args: InferArguments(model='/home/zhangzhicheng03/HuggingFace/VideoLLM/models--Qwen--Qwen2.5-VL-72B-Instruct', model_type='qwen2_5_vl', model_revision=None, task_type='causal_lm', torch_dtype=torch.bfloat16, attn_impl='flash_attn', num_labels=None, rope_scaling=None, device_map=None, local_repo_path=None, template='qwen2_5_vl', system=None, max_length=None, truncation_strategy='delete', max_pixels=None, tools_prompt='react_en', norm_bbox=None, padding_side='right', loss_scale='default', sequence_parallel_size=1, use_chat_template=True, template_backend='swift', dataset=[], val_dataset=['/home/zhangzhicheng03/code/face-llm/all_anno_clean_v2/QA_train_split/QA_training_21.json'], split_dataset_ratio=0.0, data_seed=42, dataset_num_proc=1, streaming=False, enable_cache=False, download_mode='reuse_dataset_if_exists', columns={}, strict=False, remove_unused_columns=True, model_name=[None, None], model_author=[None, None], custom_dataset_info=[], quant_method=None, quant_bits=None, hqq_axis=None, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, max_new_tokens=2048, temperature=None, top_k=None, top_p=None, repetition_penalty=None, num_beams=1, stream=False, stop_words=[], logprobs=False, top_logprobs=None, ckpt_dir=None, load_dataset_config=None, lora_modules=[], tuner_backend='peft', train_type='lora', adapters=[], seed=42, model_kwargs={}, load_args=True, load_data_args=False, use_hf=False, hub_token=None, custom_register_path=[], ignore_args_error=False, use_swift_lora=False, tp=1, session_len=None, cache_max_entry_count=0.8, quant_policy=0, vision_batch_size=1, gpu_memory_utilization=0.8, tensor_parallel_size=4, pipeline_parallel_size=1, max_num_seqs=16, max_model_len=32768, disable_custom_all_reduce=False, enforce_eager=False, limit_mm_per_prompt={'image': 0, 'video': 1}, vllm_max_lora_rank=16, enable_prefix_caching=False, merge_lora=False, safe_serialization=True, max_shard_size='5GB', infer_backend='vllm', result_path='/home/zhangzhicheng03/code/face-llm/qwenvl/QA_ver_res_train/QA_training_21.json', metric=None, max_batch_size=8, ddp_backend=None, val_dataset_sample=None) [INFO:swift] Loading the model using model_dir: /home/zhangzhicheng03/HuggingFace/VideoLLM/models--Qwen--Qwen2.5-VL-72B-Instruct Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False. [INFO:swift] Setting image_factor: 28. You can adjust this hyperparameter through the environment variable: IMAGE_FACTOR. [INFO:swift] Setting min_pixels: 3136. You can adjust this hyperparameter through the environment variable: MIN_PIXELS. [INFO:swift] Setting max_pixels: 12845056. You can adjust this hyperparameter through the environment variable: MAX_PIXELS. [INFO:swift] Setting max_ratio: 200. You can adjust this hyperparameter through the environment variable: MAX_RATIO. [INFO:swift] Setting video_min_pixels: 100352. You can adjust this hyperparameter through the environment variable: VIDEO_MIN_PIXELS. [INFO:swift] Using environment variable VIDEO_MAX_PIXELS, Setting video_max_pixels: 100352. [INFO:swift] Setting video_total_pixels: 100352. You can adjust this hyperparameter through the environment variable: VIDEO_TOTAL_PIXELS. [INFO:swift] Setting frame_factor: 2. You can adjust this hyperparameter through the environment variable: FRAME_FACTOR. [INFO:swift] Setting fps: 2.0. You can adjust this hyperparameter through the environment variable: FPS. [INFO:swift] Setting fps_min_frames: 4. You can adjust this hyperparameter through the environment variable: FPS_MIN_FRAMES. [INFO:swift] Using environment variable FPS_MAX_FRAMES, Setting fps_max_frames: 16. DEBUG 04-27 03:13:24 init.py:28] No plugins for group vllm.platform_plugins found. INFO 04-27 03:13:24 init.py:207] Automatically detected platform cuda. DEBUG 04-27 03:13:24 init.py:28] No plugins for group vllm.general_plugins found.

INFO 04-27 03:13:30 config.py:549] This model supports multiple tasks: {'generate', 'classify', 'score', 'embed', 'reward'}. Defaulting to 'generate'. INFO 04-27 03:13:30 config.py:1382] Defaulting to use mp for distributed inference INFO 04-27 03:13:30 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/zhangzhicheng03/HuggingFace/VideoLLM/models--Qwen--Qwen2.5-VL-72B-Instruct', speculative_config=None, tokenizer='/home/zhangzhicheng03/HuggingFace/VideoLLM/models--Qwen--Qwen2.5-VL-72B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/zhangzhicheng03/HuggingFace/VideoLLM/models--Qwen--Qwen2.5-VL-72B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[16,8,4,2,1],"max_capture_size":16}, use_cached_outputs=False, WARNING 04-27 03:13:30 multiproc_worker_utils.py:300] Reducing Torch parallelism from 58 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 04-27 03:13:30 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager WARNING 04-27 03:13:30 logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only. INFO 04-27 03:13:30 logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-ed9df/VLLM_TRACE_FUNCTION_for_process_47298_thread_140669624403200_at_2025-04-27_03:13:30.791136.log [INFO:swift] Successfully registered /home/zhangzhicheng03/code/face-llm/ms-swift/swift/llm/dataset/data/dataset_info.json [INFO:swift] Successfully registered /home/zhangzhicheng03/code/face-llm/ms-swift/swift/llm/dataset/data/dataset_info.json [INFO:swift] Successfully registered /home/zhangzhicheng03/code/face-llm/ms-swift/swift/llm/dataset/data/dataset_info.json DEBUG 04-27 03:13:36 init.py:28] No plugins for group vllm.platform_plugins found. INFO 04-27 03:13:36 init.py:207] Automatically detected platform cuda. DEBUG 04-27 03:13:36 init.py:28] No plugins for group vllm.platform_plugins found. INFO 04-27 03:13:36 init.py:207] Automatically detected platform cuda. (VllmWorkerProcess pid=47808) INFO 04-27 03:13:36 multiproc_worker_utils.py:229] Worker ready; awaiting tasks (VllmWorkerProcess pid=47808) WARNING 04-27 03:13:36 logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only. (VllmWorkerProcess pid=47808) INFO 04-27 03:13:36 logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-ed9df/VLLM_TRACE_FUNCTION_for_process_47808_thread_140652367746304_at_2025-04-27_03:13:36.855844.log (VllmWorkerProcess pid=47806) INFO 04-27 03:13:36 multiproc_worker_utils.py:229] Worker ready; awaiting tasks (VllmWorkerProcess pid=47806) WARNING 04-27 03:13:36 logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only. (VllmWorkerProcess pid=47806) INFO 04-27 03:13:36 logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-ed9df/VLLM_TRACE_FUNCTION_for_process_47806_thread_139679837152512_at_2025-04-27_03:13:36.912702.log DEBUG 04-27 03:13:36 init.py:28] No plugins for group vllm.platform_plugins found. INFO 04-27 03:13:36 init.py:207] Automatically detected platform cuda. (VllmWorkerProcess pid=47807) INFO 04-27 03:13:37 multiproc_worker_utils.py:229] Worker ready; awaiting tasks (VllmWorkerProcess pid=47807) WARNING 04-27 03:13:37 logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only. (VllmWorkerProcess pid=47807) INFO 04-27 03:13:37 logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-ed9df/VLLM_TRACE_FUNCTION_for_process_47807_thread_140082128848128_at_2025-04-27_03:13:37.043274.log (VllmWorkerProcess pid=47808) DEBUG 04-27 03:13:37 init.py:28] No plugins for group vllm.general_plugins found. (VllmWorkerProcess pid=47806) DEBUG 04-27 03:13:37 init.py:28] No plugins for group vllm.general_plugins found. INFO 04-27 03:13:37 cuda.py:229] Using Flash Attention backend. DEBUG 04-27 03:13:37 config.py:3461] enabled custom ops: Counter() DEBUG 04-27 03:13:37 config.py:3463] disabled custom ops: Counter() (VllmWorkerProcess pid=47807) DEBUG 04-27 03:13:37 init.py:28] No plugins for group vllm.general_plugins found. (VllmWorkerProcess pid=47808) INFO 04-27 03:13:43 cuda.py:229] Using Flash Attention backend. (VllmWorkerProcess pid=47808) DEBUG 04-27 03:13:43 config.py:3461] enabled custom ops: Counter() (VllmWorkerProcess pid=47808) DEBUG 04-27 03:13:43 config.py:3463] disabled custom ops: Counter() (VllmWorkerProcess pid=47806) INFO 04-27 03:13:43 cuda.py:229] Using Flash Attention backend. (VllmWorkerProcess pid=47806) DEBUG 04-27 03:13:43 config.py:3461] enabled custom ops: Counter() (VllmWorkerProcess pid=47806) DEBUG 04-27 03:13:43 config.py:3463] disabled custom ops: Counter() (VllmWorkerProcess pid=47807) INFO 04-27 03:13:44 cuda.py:229] Using Flash Attention backend. (VllmWorkerProcess pid=47807) DEBUG 04-27 03:13:44 config.py:3461] enabled custom ops: Counter() (VllmWorkerProcess pid=47807) DEBUG 04-27 03:13:44 config.py:3463] disabled custom ops: Counter() DEBUG 04-27 03:13:44 parallel_state.py:810] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://10.252.128.175:65209 backend=nccl (VllmWorkerProcess pid=47808) DEBUG 04-27 03:13:44 parallel_state.py:810] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://10.252.128.175:65209 backend=nccl (VllmWorkerProcess pid=47806) DEBUG 04-27 03:13:44 parallel_state.py:810] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://10.252.128.175:65209 backend=nccl (VllmWorkerProcess pid=47807) DEBUG 04-27 03:13:44 parallel_state.py:810] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://10.252.128.175:65209 backend=nccl (VllmWorkerProcess pid=47808) INFO 04-27 03:13:45 utils.py:916] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=47808) INFO 04-27 03:13:45 pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=47807) INFO 04-27 03:13:45 utils.py:916] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=47807) INFO 04-27 03:13:45 pynccl.py:69] vLLM is using nccl==2.21.5 INFO 04-27 03:13:45 utils.py:916] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=47806) INFO 04-27 03:13:45 utils.py:916] Found nccl from library libnccl.so.2 INFO 04-27 03:13:45 pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=47806) INFO 04-27 03:13:45 pynccl.py:69] vLLM is using nccl==2.21.5 a800bcctest0136-bd:47298:47298 [0] NCCL INFO Bootstrap : Using eth0:10.252.128.175<0> a800bcctest0136-bd:47298:47298 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so) a800bcctest0136-bd:47298:47298 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so a800bcctest0136-bd:47298:47298 [0] NCCL INFO NET/Plugin: Using internal network plugin. a800bcctest0136-bd:47298:47298 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.21.5+cuda12.4 a800bcctest0136-bd:47808:47808 [3] NCCL INFO cudaDriverVersion 12020 a800bcctest0136-bd:47807:47807 [2] NCCL INFO cudaDriverVersion 12020 a800bcctest0136-bd:47806:47806 [1] NCCL INFO cudaDriverVersion 12020 a800bcctest0136-bd:47808:47808 [3] NCCL INFO Bootstrap : Using eth0:10.252.128.175<0> a800bcctest0136-bd:47807:47807 [2] NCCL INFO Bootstrap : Using eth0:10.252.128.175<0> a800bcctest0136-bd:47806:47806 [1] NCCL INFO Bootstrap : Using eth0:10.252.128.175<0> a800bcctest0136-bd:47808:47808 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so) a800bcctest0136-bd:47808:47808 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so a800bcctest0136-bd:47808:47808 [3] NCCL INFO NET/Plugin: Using internal network plugin. a800bcctest0136-bd:47807:47807 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so) a800bcctest0136-bd:47807:47807 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so a800bcctest0136-bd:47807:47807 [2] NCCL INFO NET/Plugin: Using internal network plugin. a800bcctest0136-bd:47806:47806 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so) a800bcctest0136-bd:47806:47806 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so a800bcctest0136-bd:47806:47806 [1] NCCL INFO NET/Plugin: Using internal network plugin. a800bcctest0136-bd:47808:47808 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1. a800bcctest0136-bd:47808:47808 [3] NCCL INFO NET/Socket : Using [0]eth0:10.252.128.175<0> [1]kflax:11.43.252.176<0> [2]kflax-vxlan:fe80::34a6:77ff:fef3:8a98%kflax-vxlan<0> a800bcctest0136-bd:47808:47808 [3] NCCL INFO Using non-device net plugin version 0 a800bcctest0136-bd:47808:47808 [3] NCCL INFO Using network Socket a800bcctest0136-bd:47807:47807 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1. a800bcctest0136-bd:47807:47807 [2] NCCL INFO NET/Socket : Using [0]eth0:10.252.128.175<0> [1]kflax:11.43.252.176<0> [2]kflax-vxlan:fe80::34a6:77ff:fef3:8a98%kflax-vxlan<0> a800bcctest0136-bd:47807:47807 [2] NCCL INFO Using non-device net plugin version 0 a800bcctest0136-bd:47807:47807 [2] NCCL INFO Using network Socket a800bcctest0136-bd:47298:47298 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. a800bcctest0136-bd:47298:47298 [0] NCCL INFO NET/Socket : Using [0]eth0:10.252.128.175<0> [1]kflax:11.43.252.176<0> [2]kflax-vxlan:fe80::34a6:77ff:fef3:8a98%kflax-vxlan<0> a800bcctest0136-bd:47298:47298 [0] NCCL INFO Using non-device net plugin version 0 a800bcctest0136-bd:47298:47298 [0] NCCL INFO Using network Socket a800bcctest0136-bd:47806:47806 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1. a800bcctest0136-bd:47806:47806 [1] NCCL INFO NET/Socket : Using [0]eth0:10.252.128.175<0> [1]kflax:11.43.252.176<0> [2]kflax-vxlan:fe80::34a6:77ff:fef3:8a98%kflax-vxlan<0> a800bcctest0136-bd:47806:47806 [1] NCCL INFO Using non-device net plugin version 0 a800bcctest0136-bd:47806:47806 [1] NCCL INFO Using network Socket a800bcctest0136-bd:47298:47298 [0] NCCL INFO ncclCommInitRank comm 0x13593af0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 61000 commId 0x45f3f6986cbbbd5 - Init START a800bcctest0136-bd:47806:47806 [1] NCCL INFO ncclCommInitRank comm 0xe47a2f0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 62000 commId 0x45f3f6986cbbbd5 - Init START a800bcctest0136-bd:47807:47807 [2] NCCL INFO ncclCommInitRank comm 0xf456a40 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 6b000 commId 0x45f3f6986cbbbd5 - Init START a800bcctest0136-bd:47808:47808 [3] NCCL INFO ncclCommInitRank comm 0xeeb36f0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 6c000 commId 0x45f3f6986cbbbd5 - Init START a800bcctest0136-bd:47806:47806 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. a800bcctest0136-bd:47808:47808 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. a800bcctest0136-bd:47807:47807 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. a800bcctest0136-bd:47298:47298 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. /home/zhangzhicheng03/anaconda3/envs/videva/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 12 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

nku-zhichengzhang avatar Apr 26 '25 19:04 nku-zhichengzhang

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Jul 27 '25 02:07 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Aug 27 '25 02:08 github-actions[bot]