Qwen3-Next-80B-A3B-Thinking hangs during multi-node SFT training with Ray (NCCL timeout on InfiniBand)

Open 4daJKong opened this issue 2 months ago • 1 comments

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

When training Qwen3-Next-80B-A3B-Thinking across multiple nodes using Ray, the following error occurs:

(RayTrainWorker pid=364, ip=172.44.17.165) [rank19]:
[E1016 06:52:29.641158254 ProcessGroupNCCL.cpp:1895]
[PG ID 0 PG GUID 0(default_pg) Rank 19] Process group watchdog thread terminated with exception:
[Rank 19] Watchdog caught collective operation timeout:
WorkNCCL(SeqNum=298389, OpType=_ALLGATHER_BASE, NumelIn=32768, NumelOut=1048576, Timeout(ms)=1800000)
ran for 1800036 milliseconds before timing out.
[repeated 2x across cluster]

According to GPT’s explanation, this indicates an NCCL collective operation timeout, and it appears that the network is using sockets (NET/Socket) instead of InfiniBand.

However, the issue seems specific to Qwen3-Next-80B-A3B-Thinking — other Qwen3 models do not show this problem under the same setup.

The model loads successfully, and training starts.

The progress bar freezes right after initialization:

(RayTrainWorker pid=9940, ip=172.44.17.164)
[INFO|trainer.py:2528] >> Number of trainable parameters = 79,674,391,296
0%|          | 0/84 [00:00<?, ?it/s]

After a short while, training is interrupted with the NCCL timeout error above.

The issue persists across different cluster sizes:

2 servers (each with 16×80 GB GPUs)
4 servers (each with 32×80 GB GPUs)

Therefore, it’s unlikely to be an OOM (Out-of-Memory) issue.

The problem seems inherent to this specific model version.

Reproduction

Put your message here.

Others

No response

Oct 16 '25 09:10 4daJKong

Met a similar problem. My model is Qwen3-Coder-30B-A3B-Instruct, and I do DPO training with 8xH100 GPUs. The training stuck in step 0 and shows NCCL Timeout.

Oct 24 '25 06:10 Chevolier