qwen3 omni GPPO 多机多卡报 NCCL 通信错误
vllm: 0.11.0
transformers: 4.57.1
gpu: 4*8 h100
Watchdog caught collective operation timeout: WorkNCCL(...) ran for 600000+ milliseconds before timing out ``[rank6]:[E1112 22:23:46.917850023 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 6] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 23, last completed NCCL work: 23.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank1]:[E1112 22:23:46.917871206 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 23, last completed NCCL work: 23.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank3]:[E1112 22:23:46.917882738 ProcessGroupNCCL.cpp:1806] [PG ID 0 PG GUID 0(default_pg) Rank 3] Observed flight recorder dump signal from another rank via TCPStore. [rank2]:[E1112 22:23:46.917905688 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank0]:[E1112 22:23:46.917913820 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
训练脚本是` pip install transformers math_verify qwen_omni_utils trl -U pip uninstall -y ms-swift cd ms-swift pip install -e .
export NNODES=${WORLD_SIZE} # 平台的WORLD_SIZE是节点数 export NODE_RANK=${RANK} # 节点排名
LOG_FILE=20251113log_${NODE_RANK}.txt
MAX_PIXELS=1003520
NPROC_PER_NODE=8
ENABLE_AUDIO_OUTPUT=0
swift rlhf
--rlhf_type grpo
--model /mnt/afs/lujiefan/open_source_model/Qwen/Qwen3-Omni-30B-A3B-Instruct
--reward_funcs icassp_task3
--reward_weights 1
--train_type lora
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--torch_dtype bfloat16
--dataset task1_3_GRPO_extra_info
--load_from_cache_file true
--external_plugins plugin.py
--max_completion_length 512
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-5
--gradient_accumulation_steps 1
--eval_steps 100
--save_steps 100
--save_total_limit 2
--logging_steps 5
--max_length 1024
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 8
--dataset_num_proc 8
--num_generations 8
--temperature 1.
--top_p 0.99
--top_k 50
--system prompt.txt
--deepspeed zero2
--log_completions true
2>&1 | tee ${LOG_FILE}`
We also encountered a similar problem.
INFO: 127.0.0.1:46204 - "POST /reset_prefix_cache/ HTTP/1.1" 200 OK
0%| | 0/48 [00:00<?, ?it/s][rank0]:[E1121 09:56:33.026928129 ProcessGroupNCCL.cpp:685] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=40752, OpType=ALLREDUCE, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
[rank0]:[E1121 09:56:33.028415279 ProcessGroupNCCL.cpp:2252] [PG ID 5 PG GUID 15 Rank 0] failure detected by watchdog at work sequence id: 40752 PG status: last enqueued work: 40752, last completed work: 40751
[rank0]:[E1121 09:56:33.028450724 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E1121 09:56:33.028528273 ProcessGroupNCCL.cpp:2584] [PG ID 5 PG GUID 15 Rank 0] First PG on this rank to signal dumping.
[rank0]:[E1121 09:56:34.516023294 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0 Rank 0] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank0]:[E1121 09:56:34.516487503 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank1]:[E1121 09:56:34.907947305 ProcessGroupNCCL.cpp:1806] [PG ID 0 PG GUID 0 Rank 1] Observed flight recorder dump signal from another rank via TCPStore.
[rank1]:[E1121 09:56:34.908219769 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0 Rank 1] Received a dump signal due to a collective timeout from rank 0 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank1]:[E1121 09:56:34.908984968 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0 Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank0]:[E1121 09:57:33.028791605 ProcessGroupNCCL.cpp:746] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1121 09:57:33.028889731 ProcessGroupNCCL.cpp:760] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E1121 09:57:33.032499119 ProcessGroupNCCL.cpp:2068] [PG ID 5 PG GUID 15 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=40752, OpType=ALLREDUCE, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits
ERROR 11-21 09:57:34 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client. [rank1]:[F1121 10:04:34.910279402 ProcessGroupNCCL.cpp:1614] [PG ID 0 PG GUID 0 Rank 1] [PG ID 0 PG GUID 0 Rank 1] Terminating the process after attempting to dump debug info, due to collective timeout or exception. ERROR 11-21 10:04:34 [core_client.py:564] Engine core proc EngineCore_DP1 died unexpectedly, shutting down client. Process Process-2: Traceback (most recent call last): File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/miniconda3/lib/python3.12/site-packages/swift/llm/infer/rollout.py", line 112, in llm_worker result = method(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/swift/llm/infer/infer_engine/grpo_vllm_engine.py", line 103, in infer res = super().infer( ^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/swift/llm/infer/infer_engine/vllm_engine.py", line 694, in infer self._add_request(inputs, generation_config, request_id, adapter_request=adapter_request) File "/root/miniconda3/lib/python3.12/site-packages/swift/llm/infer/infer_engine/vllm_engine.py", line 379, in _add_request return self.engine.add_request(request_id, llm_inputs, generation_config, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 230, in add_request prompt_str, request = self.processor.process_inputs( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/vllm/v1/engine/processor.py", line 377, in process_inputs processed_inputs: ProcessorInputs = self.input_preprocessor.preprocess( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 644, in preprocess return self._process_decoder_only_prompt( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 614, in _process_decoder_only_prompt prompt_comps = self._prompt_to_llm_inputs( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 388, in _prompt_to_llm_inputs return self._process_tokens( ^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 317, in _process_tokens inputs = self._process_multimodal( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 242, in _process_multimodal mm_input = mm_processor.apply( ^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/vllm/multimodal/processing.py", line 2036, in apply ) = self._cached_apply_hf_processor( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/vllm/multimodal/processing.py", line 1846, in _cached_apply_hf_processor mm_kwargs, mm_prompt_updates = self._merge_mm_kwargs( ^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/vllm/multimodal/processing.py", line 1723, in _merge_mm_kwargs kwargs, updates = cache.get_and_update_item(item, item_hash) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 323, in get_and_update_item assert mm_item is not None, f"Expected a cached item for {mm_hash=}" ^^^^^^^^^^^^^^^^^^^ AssertionError: Expected a cached item for mm_hash='445f392011cf185ab6bc8ee44df49189843d70a0185d514d97e0c87bbb26c717'
adjust vllm_cache_size to 0 or larger value --vllm_mm_processor_cache_gb