ZhiyuLi-Nvidia
ZhiyuLi-Nvidia
> Update: > The checkpoint save will also fail on GB200 even with small number of nodes, e.g., 4. > we may need another person to look into checkpointing failure...
> @ZhiyuLi-Nvidia is the mcore change gonna be a PR or already merged? @guyueh1 For GB200, the above change is a just a **temporary fix** to walk around and it...
> > a nccl collective timeout when scaling up to 64 nodes > > I'd expect we need some help from experts in mcore checkpoint saving. Now synced up with...
@nithinraok could you help take a look?
Thank you for contribution! We fixed it by following your PR. https://github.com/NVIDIA/Megatron-LM/commit/a77a883e248e68df1912df4ef2cf05b712947fce Let us know what you think.