improved-diffusion
improved-diffusion copied to clipboard
Resume training does not work for multi-gpus training
I met the problem when resuming training. A similar issues has happened in its successor repo : https://github.com/openai/guided-diffusion/issues/23.
It works well upon single gpu mode and single node-multigpus but not multinodes-multigpus.
Are there any suggestions ?
One way to tack this is to first load ckpt/opt before DDP, as suggested in https://github.com/pytorch/pytorch/issues/23138.
If there are other ways around, please leave comments here. Thanks.
@JiamingLiu-Jeremy , it has been solved, please refer to https://github.com/openai/guided-diffusion/issues/23