Baichuan-7B icon indicating copy to clipboard operation
Baichuan-7B copied to clipboard

[Question] DeepSpeed Zero3 save_checkpoint() got empty mode_states files

Open mynewstart opened this issue 1 year ago • 3 comments

Required prerequisites

Questions

Hi, I used the code to continue pretrain the model and used zero3 for model training. But I found my checkpoint file zero_pp_rank_*_mp_rank_00_model_states.pt is empty, the file only has model parameters name and shape, don't have the weights. Have you ever met this problem and how to fix?

Thanks!

Checklist

  • [X] I have provided all relevant and necessary information above.
  • [X] I have chosen a suitable title for this issue.

mynewstart avatar Sep 11 '23 05:09 mynewstart

I have met the same problem and my solution is to use deepspeed zero2 instead of zero3

hmtbgc avatar Sep 16 '23 09:09 hmtbgc

My solution is to save checkpoints by myself or you can use zero_to_fp32

mynewstart avatar Sep 19 '23 06:09 mynewstart

My solution is to save checkpoints by myself or you can use zero_to_fp32

@mynewstart I found my converted ckpt global_step_xxx only contains meaningful *optim_states.pt but only empty *model_states.pt. Any clues on this? Thanks.

haorannlp avatar Mar 12 '24 15:03 haorannlp