Jonah Noh
Results
1
comments of
Jonah Noh
I am also seeing inconsistent loss values when resuming from checkpoint in mcore-v0.12. This is with megatron's distributed checkpointing format `--ckpt-format=torch_dist`. If I set `--ckpt-format=torch`, I am able to resume...