Palette-Image-to-Image-Diffusion-Models
Palette-Image-to-Image-Diffusion-Models copied to clipboard
Cannot resume training on multiple GPUs
I'm trying to train an inpainting model using multiple GPUs.
Initial training worked fine, progressed well and saved checkpoints to experiments/.../checkpoint
folder.
However, when I try to resume the same training (by modifying "resume_state"
in the config) I get this error:
Traceback (most recent call last):
File "/.../Palette-Image-to-Image-Diffusion-Models/run.py", line 58, in main_worker
model.train()
File "/.../Palette-Image-to-Image-Diffusion-Models/core/base_model.py", line 45, in train
train_log = self.train_step()
File "/.../Palette-Image-to-Image-Diffusion-Models/models/model.py", line 111, in train_step
self.optG.step()
File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/optimizer.py", line 109, in wrapper
return func(*args, **kwargs)
File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 157, in step
adam(params_with_grad,
File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 213, in adam
func(params,
File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 255, in _single_tensor_adam
assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
AssertionError: If capturable=False, state_steps should not be CUDA tensors.
It seems like in multi-GPU when resuming some tensors (parameters or optimizer's internal variables) are not moved to the right device.
It looks like this is an issue tied to PyTorch 1.12.0. I ran into the same issue, and I'm now downgrading to PyTorch 1.11.0, which should solve the problem.
Feel free to reopen the issue if there is any question.