Qwen3-Next-80B-A3B-Thinking hangs during multi-node SFT training with Ray (NCCL timeout on InfiniBand)
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
When training Qwen3-Next-80B-A3B-Thinking across multiple nodes using Ray, the following error occurs:
(RayTrainWorker pid=364, ip=172.44.17.165) [rank19]:
[E1016 06:52:29.641158254 ProcessGroupNCCL.cpp:1895]
[PG ID 0 PG GUID 0(default_pg) Rank 19] Process group watchdog thread terminated with exception:
[Rank 19] Watchdog caught collective operation timeout:
WorkNCCL(SeqNum=298389, OpType=_ALLGATHER_BASE, NumelIn=32768, NumelOut=1048576, Timeout(ms)=1800000)
ran for 1800036 milliseconds before timing out.
[repeated 2x across cluster]
According to GPT’s explanation, this indicates an NCCL collective operation timeout, and it appears that the network is using sockets (NET/Socket) instead of InfiniBand.
However, the issue seems specific to Qwen3-Next-80B-A3B-Thinking — other Qwen3 models do not show this problem under the same setup.
The model loads successfully, and training starts.
The progress bar freezes right after initialization:
(RayTrainWorker pid=9940, ip=172.44.17.164)
[INFO|trainer.py:2528] >> Number of trainable parameters = 79,674,391,296
0%| | 0/84 [00:00<?, ?it/s]
After a short while, training is interrupted with the NCCL timeout error above.
The issue persists across different cluster sizes:
- 2 servers (each with 16×80 GB GPUs)
- 4 servers (each with 32×80 GB GPUs)
Therefore, it’s unlikely to be an OOM (Out-of-Memory) issue.
The problem seems inherent to this specific model version.
Reproduction
Put your message here.
Others
No response
Met a similar problem. My model is Qwen3-Coder-30B-A3B-Instruct, and I do DPO training with 8xH100 GPUs. The training stuck in step 0 and shows NCCL Timeout.