DiT icon indicating copy to clipboard operation
DiT copied to clipboard

Resume training from a checkpoint

Open yukang2017 opened this issue 2 years ago • 14 comments

Thanks for your great work. I implement the resuming choice for training. It can be used as following.

For example,

torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /path/to/imagenet/train --resume results/000/checkpoints/0100000.pt

yukang2017 avatar Feb 16 '23 03:02 yukang2017

Hi @yukang2017!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Feb 16 '23 03:02 facebook-github-bot

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot avatar Feb 16 '23 09:02 facebook-github-bot

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot avatar Feb 16 '23 10:02 facebook-github-bot

Thank you very much for your resume. I encountered an error here. What is ‘_ddp_dict,’? Thank you! line 203 in train.py NameError: name '_ddp_dict' is not defined

Gongrunlin avatar Feb 20 '23 11:02 Gongrunlin

Hi,

It should be as following.

def _ddp_dict(_dict):
    new_dict = {}
    for k in _dict:
        new_dict['module.' + k] = _dict[k]
    return new_dict

yukang2017 avatar Feb 21 '23 14:02 yukang2017

Thanks!I wish you a happy life and work!

Gongrunlin avatar Feb 22 '23 02:02 Gongrunlin

This looks really awesome! @yukang2017 I was doing something similar, but when I resume from a checkpoint, it seems the loss is not exactly going down from the point where it stopped, do you observe a similar phenomenon?

NathanYanJing avatar Mar 19 '23 04:03 NathanYanJing

@yukang2017 this is great and a much needed feature to be added. I tried your modifications to resume from a checkpoint. The loss in the beginning was around 0.21 and after 1M iterations was about 0.14. But upon restarting from the checkpoint with your modifications, the loss goes back to the starting value (0.21).

I believe there could be bug that resets the loss value ? I also check that you save the optimizer state, so not sure what this is about.

It would be great if you could please investigate.

achen46 avatar Mar 21 '23 00:03 achen46

@yukang2017 I observe the same exact issue as you mentioned. Loss goes back up. I wonder if this may be due to the EMA weights ?

achen46 avatar Mar 21 '23 14:03 achen46

@achen46 I am also having a similar question whether this is ude to the EMA weights, but I thought EMA weights has been stored right?

NathanYanJing avatar Mar 21 '23 17:03 NathanYanJing

@NathanYanJing I believe so as the saved model is too large ~ 9G. But it could also be the fact that we overwrite them again and hence loss going back to the beginning value.

It would be great to know @yukang2017 opinion

achen46 avatar Mar 22 '23 14:03 achen46

@achen46 I’ve encountered the exact same issue you described earlier, and as a result, I’ve created a new pull request #36 . I’ve tested it on my end, and it works for me. Hopefully, it will be helpful to you as well.

Littleor avatar Mar 23 '23 12:03 Littleor

@achen46 I’ve encountered the exact same issue you described earlier, and as a result, I’ve created a new pull request #36 . I’ve tested it on my end, and it works for me. Hopefully, it will be helpful to you as well.

Hi @Littleor thanks a lot. I will check it out and verify.

achen46 avatar Mar 29 '23 03:03 achen46