transformers resume_from_checkpoint is not working with Deepspeed

System Info

transformers version: 4.26.1
Platform: Linux-5.4.0-139-generic-x86_64-with-glibc2.17
Python version: 3.8.16
Huggingface_hub version: 0.12.1
PyTorch version (GPU?): 1.13.1 (True)
Tensorflow version (GPU?): 2.7.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: true
Using distributed or parallel set-up in script?: true

Who can help?

@stas00

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

trainer with deepspeed using stage_2 or 3 (I think it does not matter)
set save_strategy = 'epoch', i.e., save every epoch
you cannot use resume_from_checkpoint to resume the training procedure
why? in transformers/deepspeed.py/L359, deepspeed_engine.load_checkpoint actually needs an argument called tag or you need have a "latest file" in the checkpoint directory. However, neither of them are supported by trainer. The trainer does not provide a chance to pass tag , and does not store a "latest file" in the checkpoint directory.
related to deepspeed/runtime/engine.py/L2712

Expected behavior

It should work well as passed resume_from_checkpoint.

Mar 15 '23 03:03 Raibows

Hi @Raibows, you're giving me no reproduction so there is nothing I can do here as i have no idea what you did.

there is no need for tag, deepspeed's save_checkpoint creates a latest file and uses that to find the checkpoint for resume.

I can send you to a test that validates the resume works - give it a try:

https://github.com/huggingface/transformers/blob/f7329751fe5c43365751951502c00df5a4654359/tests/deepspeed/test_deepspeed.py#L636-L691

To run this test do:

RUN_SLOW=1 pytest tests/deepspeed/test_deepspeed.py -k test_can_resume_training_normal

or is it something specific to save_strategy = 'epoch'? I have only used the default strategy - can you change my test to reproduce your Issue?

Mar 15 '23 03:03 stas00

Hi, thanks for your reply.

But actually I don't have any "latest" file in the output_dir. Here is the screenshot: And in every checkpoint-xxx directory, we have In the globalstepxxx directory, we have

If I pass resume_from_checkpoint = output_dir/checkpoint-xxx, it will throw the error I mentioned.

Thanks for your test scripts. I will try it later.

Mar 15 '23 06:03 Raibows

I totally believe you that this is the case. But I don't have access to your computer. So if there is a bug I need to be able to reproduce it. which means that ideally you'd send a small script that shows the problem.

As I suggested perhaps you could adapt the test I sent to you to your particular situation and use it as the reproduction that demonstrates the problem.

Mar 15 '23 17:03 stas00

Hi, sorry for the late response. I test many times and find it very weird. Now the latest file exists.

But "zero_pp_rank_x_mp_rank_00_optim_states.pt" this file has some problems in saving.

I have posted a gist in https://gist.github.com/Raibows/73c3a6105c0226669910d5608f5efb4e

If you set the num of training samples to very few, which indicates that the save_checkpoint will be executed very soon after running. All the ckpts are saved very well.

However, if you let it run for a longer time. it will only save 1 ckpt "zero_pp_rank_0_mp_rank_00_optim_states.pt" no other "zero_pp_rank_1_mp_rank_00_optim_states.pt, zero_pp_rank_2_mp_rank_00_optim_states.pt" ...... which should be saved. This will cause the fatal error when you are trying to resume from them.

Comment out L59 in the gist and run it with

torchrun --nproc_per_node 4 test_save.py

Mar 17 '23 13:03 Raibows

I'm not sure if you had the same issue, but when I tried to resume a deepspeed run, it would try to load the right checkpoint but fail to find a pytorch_model.bin file. So I just ran the zero_to_fp32.py script to create the checkpoint and resuming with deepspeed just worked, it loaded the optimizer states / model states from the global_stepXXX/ folder.

I'm on transformers version 4.27.1

Mar 17 '23 17:03 RameshArvind

@Raibows, thank you for providing an easy to use repro - you can use model_name = 'patrickvonplaten/t5-tiny-random' while debugging this as it'd be much faster and not require many resources.

I did run it for a bit and had no problems on 2 gpus.

As we are only integrating Deepspeed and the call to save_checkpoint is done correctly I think - you probably will have a better luck asking directly at https://github.com/microsoft/DeepSpeed/issues while providing your repro script.

You can validate that the integration is calling it on all ranks:

https://github.com/huggingface/transformers/blob/60d51ef5123d949fd8c59cd4d3254e711541d278/src/transformers/trainer.py#L2297-L2300

If you'd like to debug this yourself, I'd add a debug print that would include a rank self.args.local_rank - so that you'd want to see that each rank calls this deepspeed method. If it gets called on all ranks for each save, then you definitely have to take it up to the Deepspeed team. If it doesn't, which I doubt, but who knows - do get back to me.

Honestly, I have seen some reports in the past where users had some weird filesystem issues where files would not appear. Is it your personal computer that you're running this one, or some particular cloud?

Mar 17 '23 18:03 stas00

@stas00 Hi, really thanks for your help!

Now I find the reason, finally. It's my own code's fault. Since I use time-based as the path of output directory. However, we use distributed launch to launch the script which causes each process will have a little bit different path of output directory.

I'm going to close this issue. Thanks!

Mar 22 '23 02:03 Raibows

Glad you figured it out, @Raibows!

That's why we have unit tests that help us know whether the feature is working correctly and when it doesn't for a user often it has to do with some peculiarity of user's code.

Mar 22 '23 04:03 stas00