resume_from_checkpoint is not working with Deepspeed
System Info
transformersversion: 4.26.1- Platform: Linux-5.4.0-139-generic-x86_64-with-glibc2.17
- Python version: 3.8.16
- Huggingface_hub version: 0.12.1
- PyTorch version (GPU?): 1.13.1 (True)
- Tensorflow version (GPU?): 2.7.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: true
- Using distributed or parallel set-up in script?: true
Who can help?
@stas00
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
- trainer with deepspeed using stage_2 or 3 (I think it does not matter)
- set save_strategy = 'epoch', i.e., save every epoch
- you cannot use
resume_from_checkpointto resume the training procedure - why? in
transformers/deepspeed.py/L359,deepspeed_engine.load_checkpointactually needs an argument calledtagor you need have a "latest file" in the checkpoint directory. However, neither of them are supported by trainer. The trainer does not provide a chance to passtag, and does not store a "latest file" in the checkpoint directory. - related to
deepspeed/runtime/engine.py/L2712
Expected behavior
It should work well as passed resume_from_checkpoint.
Hi @Raibows, you're giving me no reproduction so there is nothing I can do here as i have no idea what you did.
there is no need for tag, deepspeed's save_checkpoint creates a latest file and uses that to find the checkpoint for resume.
I can send you to a test that validates the resume works - give it a try:
https://github.com/huggingface/transformers/blob/f7329751fe5c43365751951502c00df5a4654359/tests/deepspeed/test_deepspeed.py#L636-L691
To run this test do:
RUN_SLOW=1 pytest tests/deepspeed/test_deepspeed.py -k test_can_resume_training_normal
or is it something specific to save_strategy = 'epoch'? I have only used the default strategy - can you change my test to reproduce your Issue?
Hi, thanks for your reply.
But actually I don't have any "latest" file in the output_dir. Here is the screenshot:
And in every checkpoint-xxx directory, we have
In the globalstepxxx directory, we have

If I pass resume_from_checkpoint = output_dir/checkpoint-xxx, it will throw the error I mentioned.
Thanks for your test scripts. I will try it later.
I totally believe you that this is the case. But I don't have access to your computer. So if there is a bug I need to be able to reproduce it. which means that ideally you'd send a small script that shows the problem.
As I suggested perhaps you could adapt the test I sent to you to your particular situation and use it as the reproduction that demonstrates the problem.
Hi, sorry for the late response. I test many times and find it very weird. Now the latest file exists.
But "zero_pp_rank_x_mp_rank_00_optim_states.pt" this file has some problems in saving.
I have posted a gist in https://gist.github.com/Raibows/73c3a6105c0226669910d5608f5efb4e
If you set the num of training samples to very few, which indicates that the save_checkpoint will be executed very soon after running. All the ckpts are saved very well.
However, if you let it run for a longer time. it will only save 1 ckpt "zero_pp_rank_0_mp_rank_00_optim_states.pt" no other "zero_pp_rank_1_mp_rank_00_optim_states.pt, zero_pp_rank_2_mp_rank_00_optim_states.pt" ...... which should be saved. This will cause the fatal error when you are trying to resume from them.
Comment out L59 in the gist and run it with
torchrun --nproc_per_node 4 test_save.py
I'm not sure if you had the same issue, but when I tried to resume a deepspeed run, it would try to load the right checkpoint but fail to find a pytorch_model.bin file. So I just ran the zero_to_fp32.py script to create the checkpoint and resuming with deepspeed just worked, it loaded the optimizer states / model states from the global_stepXXX/ folder.
I'm on transformers version 4.27.1
@Raibows, thank you for providing an easy to use repro - you can use model_name = 'patrickvonplaten/t5-tiny-random' while debugging this as it'd be much faster and not require many resources.
I did run it for a bit and had no problems on 2 gpus.
As we are only integrating Deepspeed and the call to save_checkpoint is done correctly I think - you probably will have a better luck asking directly at https://github.com/microsoft/DeepSpeed/issues while providing your repro script.
You can validate that the integration is calling it on all ranks:
https://github.com/huggingface/transformers/blob/60d51ef5123d949fd8c59cd4d3254e711541d278/src/transformers/trainer.py#L2297-L2300
If you'd like to debug this yourself, I'd add a debug print that would include a rank self.args.local_rank - so that you'd want to see that each rank calls this deepspeed method. If it gets called on all ranks for each save, then you definitely have to take it up to the Deepspeed team. If it doesn't, which I doubt, but who knows - do get back to me.
Honestly, I have seen some reports in the past where users had some weird filesystem issues where files would not appear. Is it your personal computer that you're running this one, or some particular cloud?
@stas00 Hi, really thanks for your help!
Now I find the reason, finally. It's my own code's fault. Since I use time-based as the path of output directory. However, we use distributed launch to launch the script which causes each process will have a little bit different path of output directory.
I'm going to close this issue. Thanks!
Glad you figured it out, @Raibows!
That's why we have unit tests that help us know whether the feature is working correctly and when it doesn't for a user often it has to do with some peculiarity of user's code.