Dreambooth-Stable-Diffusion Multi-GPU is broken

2x3090 instance on Runpod using the Runpod notebook on their Stable Diffusion image. I can train on GPU 0, but not 0 and 1 together or even separately. Running on GPU 0 works fine.

Here is what happens when I am training on GPU 0, and try to start a separate training on GPU 1. It seems GPU 0 is hardcoded somewhere.

!python "main.py" \
 --base configs/stable-diffusion/v1-finetune_unfrozen.yaml \
 -t \
 --actual_resume "model.ckpt" \
 --reg_data_root "{reg_data_root}" \
 -n "{project_name}" \
 --gpus 1, \
 --data_root "/workspace/Dreambooth-Stable-Diffusion/MS" \
 --max_training_steps {max_training_steps} \
 --class_word "{class_word}" \
 --token "{token}" \
 --no-test

.....

Traceback (most recent call last):
  File "main.py", line 665, in <module>
    model = load_model_from_config(config, opt.actual_resume)
  File "main.py", line 42, in load_model_from_config
    model.cuda()
  File "/venv/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 138, in cuda
    return super().cuda(device=device)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 0 bytes already allocated; 13.56 MiB free; 0 bytes reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 883, in <module>
    if trainer.global_rank == 0:
NameError: name 'trainer' is not defined

Oct 06 '22 16:10 Shakahs

Something like this would fix it, no? Pass in gpuinfo when it is called in main.py.

def load_model_from_config(config, gpuinfo, ckpt, verbose=False):
    print(f"Loading model from {ckpt}")
    pl_sd = torch.load(ckpt, map_location="cpu")
    sd = pl_sd["state_dict"]
    config.model.params.ckpt_path = ckpt
    model = instantiate_from_config(config.model)
    m, u = model.load_state_dict(sd, strict=False)
    if len(m) > 0 and verbose:
        print("missing keys:")
        print(m)
    if len(u) > 0 and verbose:
        print("unexpected keys:")
        print(u)

    device = torch.device("cuda:" + str(gpuinfo).rstrip(",")) if torch.cuda.is_available() else torch.device("cpu")
    model.to(device)
    model.eval()
    return model

model.cuda() tries to call on GPU 0.

Oct 16 '22 04:10 jimtalksdata

Is multi GPU supposed to be supported?

Oct 25 '22 12:10 swcrazyfan

Potentially related (NameError: name 'trainer' is not defined):

https://github.com/JoePenna/Dreambooth-Stable-Diffusion/issues/28
https://github.com/JoePenna/Dreambooth-Stable-Diffusion/issues/53
https://github.com/JoePenna/Dreambooth-Stable-Diffusion/issues/86
https://github.com/JoePenna/Dreambooth-Stable-Diffusion/issues/87

I went to line 896 on Main.py, and changed "trainer" for "Trainer", now it's working

Originally posted by @Pegaxsus in https://github.com/JoePenna/Dreambooth-Stable-Diffusion/issues/86#issuecomment-1295861499

Nov 07 '22 01:11 0xdevalias

Something like this would fix it, no? Pass in gpuinfo when it is called in main.py.

The following seems to give a solution:

https://datascience.stackexchange.com/questions/54907/model-cuda-in-pytorch

model.cuda() by default will send your model to the "current device", which can be set with torch.cuda.set_device(device).

An alternative way to send the model to a specific device is model.to(torch.device('cuda:0')).

This, of course, is subject to the device visibility specified in the environment variable CUDA_VISIBLE_DEVICES.

You can check GPU usage with nvidia-smi. Also, nvtop is very nice for this.

The standard way in PyTorch to train a model in multiple GPUs is to use nn.DataParallel which copies the model to the GPUs and during training splits the batch among them and combines the individual outputs.

Nov 07 '22 01:11 0xdevalias

Following in the hope this get supported :)

Dec 01 '22 17:12 mprenditore

No plans to support this, PR's welcome though if you can figure it out

Feb 14 '23 21:02 djbielejeski