improved-diffusion Fail to resume model through mpi load_state

Fail to resume model through mpi load_state_dict

Open fangchuan opened this issue 2 years ago • 2 comments

hi, thanks for your awesome work, I have used it to do some interesting topic. But one thing still bother me, I cannot resume my model checkpoint when I do parallel training on multiple GPUs. The log is like blowing when I give the '--resume_checkpoint ' argument, the program is blocked when doing MPI.COMM_WORLD.bcast, then I fail to resume my training...

I thought it might caused by tensorflow environment dependencies, but the blocking still happens when I uninstall all tensorflow dependencies. Could you help me figure out what's wrong with MPI.COMM_WORLD.bcast and solve it? If solved, my work could be really accelerated. Appreciate it! @unixpickle @prafullasd

Jun 30 '23 04:06 fangchuan

Hi @fangchuan, did you manage to solve the issue?

Jul 28 '24 16:07 luk-st

Hi @fangchuan, did you manage to solve the issue?

Nevermind, I managed to fix it by suggested solution here: https://github.com/openai/guided-diffusion/issues/23

Jul 28 '24 16:07 luk-st

improved-diffusion improved-diffusion copied to clipboard

Fail to resume model through mpi load_state_dict

improved-diffusion
improved-diffusion copied to clipboard