Qwen3-30B-A3B: Checkpoint Save Failures on Large-Scale GPU Configurations (H100) and Small-Scale GB200 Systems
Describe the bug
Currently, the checkpoint fails to save for Qwen3-30B-A3B when using a higher number of GPUs under certain configurations. For example:
64 nodes (512 GPUs):
TP=4, CP=8, PP=1, ETP=1, EMP=8
TP=4, CP=4, PP=1, ETP=1, EMP=8
32 nodes (256 GPUs):
TP=4, CP=16, PP=1, ETP=1, EMP=4
What we found:
- a nccl collective timeout when scaling up to 64 nodes
- save time increases with increasing number of nodes, and ~10mins in saving a 30b model.
Update: The checkpoint save will also fail on GB200 even with small number of nodes, e.g., 4.
@ZhiyuLi-Nvidia any new updates on the nccl failure at 64 nodes? Do you think https://github.com/NVIDIA-NeMo/RL/issues/1208#issuecomment-3349433766 can help accelerate the checkpointing time, and do you think the nccl timeout is due to a functional bug?
@terrykong we may need another person to look into checkpointing failure on GB200
Update: The checkpoint save will also fail on GB200 even with small number of nodes, e.g., 4. we may need another person to look into checkpointing failure on GB200
I do expect this commit would walk us around checkpoint saving issue in GB200
- submodule commit: https://github.com/terrykong/Megatron-LM/compare/af73aa2cebf94a0bee5ea6dda2614ad989faffae...ec8eb1517bdfc25dabd6c32ed27f5275b8f5b2af I have simply changed from multi-thread async file writing to single thread one given there seems some issue in multithread writing in arm64 system.
Also verified by @wedu-nvidia. cc @guyueh1 @terrykong.
@ZhiyuLi-Nvidia is the mcore change gonna be a PR or already merged?
@ZhiyuLi-Nvidia is the mcore change gonna be a PR or already merged?
@guyueh1 For GB200, the above change is a just a temporary fix to walk around and it is very slow to use single thread. As for the next step, need to try luck with new arm based image.
a nccl collective timeout when scaling up to 64 nodes
I'd expect we need some help from experts in mcore checkpoint saving. Do you know who to turn to for help? @guyueh1 @terrykong
a nccl collective timeout when scaling up to 64 nodes
I'd expect we need some help from experts in mcore checkpoint saving.
Now synced up with @yaoyu-33 for next steps.