dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

Why model_optim_rng.pt is saved in a seperate directory?

Open zhaoyang-star opened this issue 6 months ago • 7 comments

Megatron-LM saves model_optim_rng.pt and distrib_optim.pt in directory named as mp_rank_xx_xxx. But In dlrover, distrib_optim.pt is been seperated and saved in a directory named as rank_xxxx.

It is ok if ckpt are been saved and loaded by using dlrover. But it will fail if saved by using Megatron-LM and then loaded by dlrover. So I am curious why it is been designed as this way? Thanks @workingloong

zhaoyang-star avatar Aug 02 '24 07:08 zhaoyang-star