RL Qwen3-30B-A3B: Checkpoint Save Failures on Large-Scale GPU Configurations (H100) and Small-Scale GB200 Systems

Describe the bug

Currently, the checkpoint fails to save for Qwen3-30B-A3B when using a higher number of GPUs under certain configurations. For example:

64 nodes (512 GPUs):

TP=4, CP=8, PP=1, ETP=1, EMP=8

TP=4, CP=4, PP=1, ETP=1, EMP=8

32 nodes (256 GPUs):

TP=4, CP=16, PP=1, ETP=1, EMP=4

Oct 06 '25 18:10 wedu-nvidia

What we found:

a nccl collective timeout when scaling up to 64 nodes
save time increases with increasing number of nodes, and ~10mins in saving a 30b model.

Oct 07 '25 00:10 ZhiyuLi-Nvidia

Update: The checkpoint save will also fail on GB200 even with small number of nodes, e.g., 4.

Oct 11 '25 16:10 wedu-nvidia

@ZhiyuLi-Nvidia any new updates on the nccl failure at 64 nodes? Do you think https://github.com/NVIDIA-NeMo/RL/issues/1208#issuecomment-3349433766 can help accelerate the checkpointing time, and do you think the nccl timeout is due to a functional bug?

@terrykong we may need another person to look into checkpointing failure on GB200

Oct 13 '25 21:10 guyueh1

Update: The checkpoint save will also fail on GB200 even with small number of nodes, e.g., 4. we may need another person to look into checkpointing failure on GB200

I do expect this commit would walk us around checkpoint saving issue in GB200

submodule commit: https://github.com/terrykong/Megatron-LM/compare/af73aa2cebf94a0bee5ea6dda2614ad989faffae...ec8eb1517bdfc25dabd6c32ed27f5275b8f5b2af I have simply changed from multi-thread async file writing to single thread one given there seems some issue in multithread writing in arm64 system.

Also verified by @wedu-nvidia. cc @guyueh1 @terrykong.

Oct 28 '25 01:10 ZhiyuLi-Nvidia

@ZhiyuLi-Nvidia is the mcore change gonna be a PR or already merged?

Nov 10 '25 16:11 guyueh1

@ZhiyuLi-Nvidia is the mcore change gonna be a PR or already merged?

@guyueh1 For GB200, the above change is a just a temporary fix to walk around and it is very slow to use single thread. As for the next step, need to try luck with new arm based image.

a nccl collective timeout when scaling up to 64 nodes

I'd expect we need some help from experts in mcore checkpoint saving. Do you know who to turn to for help? @guyueh1 @terrykong

Nov 10 '25 18:11 ZhiyuLi-Nvidia

a nccl collective timeout when scaling up to 64 nodes

I'd expect we need some help from experts in mcore checkpoint saving.

Now synced up with @yaoyu-33 for next steps.

Nov 10 '25 19:11 ZhiyuLi-Nvidia