dlrover
dlrover copied to clipboard
Why model_optim_rng.pt is saved in a seperate directory?
Megatron-LM saves model_optim_rng.pt
and distrib_optim.pt
in directory named as mp_rank_xx_xxx
. But In dlrover, distrib_optim.pt
is been seperated and saved in a directory named as rank_xxxx
.
It is ok if ckpt are been saved and loaded by using dlrover. But it will fail if saved by using Megatron-LM and then loaded by dlrover. So I am curious why it is been designed as this way? Thanks @workingloong