DiT
DiT copied to clipboard
Resume training from a checkpoint
Thanks for your great work. I implement the resuming choice for training. It can be used as following.
For example,
torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /path/to/imagenet/train --resume results/000/checkpoints/0100000.pt
Hi @yukang2017!
Thank you for your pull request and welcome to our community.
Action Required
In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed
. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!
Thank you very much for your resume. I encountered an error here. What is ‘_ddp_dict,’? Thank you! line 203 in train.py NameError: name '_ddp_dict' is not defined
Hi,
It should be as following.
def _ddp_dict(_dict):
new_dict = {}
for k in _dict:
new_dict['module.' + k] = _dict[k]
return new_dict
Thanks!I wish you a happy life and work!
This looks really awesome! @yukang2017 I was doing something similar, but when I resume from a checkpoint, it seems the loss is not exactly going down from the point where it stopped, do you observe a similar phenomenon?
@yukang2017 this is great and a much needed feature to be added. I tried your modifications to resume from a checkpoint. The loss in the beginning was around 0.21 and after 1M iterations was about 0.14. But upon restarting from the checkpoint with your modifications, the loss goes back to the starting value (0.21).
I believe there could be bug that resets the loss value ? I also check that you save the optimizer state, so not sure what this is about.
It would be great if you could please investigate.
@yukang2017 I observe the same exact issue as you mentioned. Loss goes back up. I wonder if this may be due to the EMA weights ?
@achen46 I am also having a similar question whether this is ude to the EMA weights, but I thought EMA weights has been stored right?
@NathanYanJing I believe so as the saved model is too large ~ 9G. But it could also be the fact that we overwrite them again and hence loss going back to the beginning value.
It would be great to know @yukang2017 opinion
@achen46 I’ve encountered the exact same issue you described earlier, and as a result, I’ve created a new pull request #36 . I’ve tested it on my end, and it works for me. Hopefully, it will be helpful to you as well.
@achen46 I’ve encountered the exact same issue you described earlier, and as a result, I’ve created a new pull request #36 . I’ve tested it on my end, and it works for me. Hopefully, it will be helpful to you as well.
Hi @Littleor thanks a lot. I will check it out and verify.