improved-diffusion
improved-diffusion copied to clipboard
Fail to resume model through mpi load_state_dict
hi, thanks for your awesome work, I have used it to do some interesting topic. But one thing still bother me, I cannot resume my model checkpoint when I do parallel training on multiple GPUs.
The log is like blowing when I give the '--resume_checkpoint ' argument, the program is blocked when doing MPI.COMM_WORLD.bcast, then I fail to resume my training...
I thought it might caused by tensorflow environment dependencies, but the blocking still happens when I uninstall all tensorflow dependencies.
Could you help me figure out what's wrong with MPI.COMM_WORLD.bcast and solve it? If solved, my work could be really accelerated. Appreciate it! @unixpickle @prafullasd
Hi @fangchuan, did you manage to solve the issue?
Hi @fangchuan, did you manage to solve the issue?
Nevermind, I managed to fix it by suggested solution here: https://github.com/openai/guided-diffusion/issues/23