Palette-Image-to-Image-Diffusion-Models Cannot resume training on multiple GPUs

I'm trying to train an inpainting model using multiple GPUs. Initial training worked fine, progressed well and saved checkpoints to experiments/.../checkpoint folder.

However, when I try to resume the same training (by modifying "resume_state" in the config) I get this error:

Traceback (most recent call last):
  File "/.../Palette-Image-to-Image-Diffusion-Models/run.py", line 58, in main_worker
    model.train()
  File "/.../Palette-Image-to-Image-Diffusion-Models/core/base_model.py", line 45, in train
    train_log = self.train_step()
  File "/.../Palette-Image-to-Image-Diffusion-Models/models/model.py", line 111, in train_step
    self.optG.step()
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/optimizer.py", line 109, in wrapper
    return func(*args, **kwargs)
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 157, in step
    adam(params_with_grad,
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 213, in adam
    func(params,
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 255, in _single_tensor_adam
    assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
AssertionError: If capturable=False, state_steps should not be CUDA tensors.

It seems like in multi-GPU when resuming some tensors (parameters or optimizer's internal variables) are not moved to the right device.

Jul 26 '22 06:07 shaibagon

It looks like this is an issue tied to PyTorch 1.12.0. I ran into the same issue, and I'm now downgrading to PyTorch 1.11.0, which should solve the problem.

Aug 05 '22 01:08 rybchuk

Feel free to reopen the issue if there is any question.

Aug 25 '22 06:08 Janspiry

Palette-Image-to-Image-Diffusion-Models Palette-Image-to-Image-Diffusion-Models copied to clipboard

Cannot resume training on multiple GPUs

Palette-Image-to-Image-Diffusion-Models
Palette-Image-to-Image-Diffusion-Models copied to clipboard