improved-diffusion icon indicating copy to clipboard operation
improved-diffusion copied to clipboard

Resume training does not work for multi-gpus training

Open JiamingLiu-Jeremy opened this issue 2 years ago • 2 comments

I met the problem when resuming training. A similar issues has happened in its successor repo : https://github.com/openai/guided-diffusion/issues/23.

It works well upon single gpu mode and single node-multigpus but not multinodes-multigpus.

Are there any suggestions ?

JiamingLiu-Jeremy avatar Jun 17 '22 16:06 JiamingLiu-Jeremy

One way to tack this is to first load ckpt/opt before DDP, as suggested in https://github.com/pytorch/pytorch/issues/23138.

If there are other ways around, please leave comments here. Thanks.

JiamingLiu-Jeremy avatar Jun 17 '22 21:06 JiamingLiu-Jeremy

@JiamingLiu-Jeremy , it has been solved, please refer to https://github.com/openai/guided-diffusion/issues/23

forever208 avatar Jun 27 '22 20:06 forever208