Dreambooth-Stable-Diffusion icon indicating copy to clipboard operation
Dreambooth-Stable-Diffusion copied to clipboard

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Open QueensGambit opened this issue 2 years ago • 9 comments

Hello, I am encountering a device match error even though the GPUs were successfully detected. Started from a docker container nvcr.io/nvidia/pytorch:22.09-py3.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
#### Data #####
train, PersonalizedBase, 4700
reg, PersonalizedBase, 15000
validation, PersonalizedBase, 47
accumulate_grad_batches = 1
++++ NOT USING LR SCALING ++++
Setting learning rate to 1.00e-06
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:326: LightningDeprecationWarning: Base `LightningModule.on_train_batch_start` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
  rank_zero_deprecation(
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:335: LightningDeprecationWarning: The `on_keyboard_interrupt` callback hook was deprecated in v1.5 and will be removed in v1.7. Please use the `on_exception` callback hook instead.
  rank_zero_deprecation(
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:391: LightningDeprecationWarning: The `Callback.on_pretrain_routine_start` hook has been deprecated in v1.6 and will be removed in v1.8. Please use `Callback.on_fit_start` instead.
  rank_zero_deprecation(
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:342: LightningDeprecationWarning: Base `Callback.on_train_batch_end` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
  rank_zero_deprecation(
Global seed set to 23
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LatentDiffusion: Also optimizing conditioner params!

  | Name              | Type               | Params
---------------------------------------------------------
0 | model             | DiffusionWrapper   | 859 M 
1 | first_stage_model | AutoencoderKL      | 83.7 M
2 | cond_stage_model  | FrozenCLIPEmbedder | 123 M 
---------------------------------------------------------
982 M     Trainable params
83.7 M    Non-trainable params
1.1 B     Total params
4,264.941 Total estimated model params size (MB)
Project config
model:
  base_learning_rate: 1.0e-06
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    reg_weight: 1.0
    linear_start: 0.00085
    linear_end: 0.012
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: image
    cond_stage_key: caption
    image_size: 64
    channels: 4
    cond_stage_trainable: true
    conditioning_key: crossattn
    monitor: val/loss_simple_ema
    scale_factor: 0.18215
    use_ema: false
    embedding_reg_weight: 0.0
    unfreeze_model: true
    model_lr: 1.0e-06
    personalization_config:
      target: ldm.modules.embedding_manager.EmbeddingManager
      params:
        placeholder_strings:
        - '*'
        initializer_words:
        - sculpture
        per_image_tokens: false
        num_vectors_per_token: 1
        progressive_words: false
    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 32
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions:
        - 4
        - 2
        - 1
        num_res_blocks: 2
        channel_mult:
        - 1
        - 2
        - 4
        - 4
        num_heads: 8
        use_spatial_transformer: true
        transformer_depth: 1
        context_dim: 768
        use_checkpoint: true
        legacy: false
    first_stage_config:
      target: ldm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 4
          resolution: 512
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
    ckpt_path: model.ckpt
data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 1
    num_workers: 2
    wrap: false
    train:
      target: ldm.data.personalized.PersonalizedBase
      params:
        size: 512
        set: train
        per_image_tokens: false
        repeats: 100
        placeholder_token: person
    reg:
      target: ldm.data.personalized.PersonalizedBase
      params:
        size: 512
        set: train
        reg: true
        per_image_tokens: false
        repeats: 10
        placeholder_token: person
    validation:
      target: ldm.data.personalized.PersonalizedBase
      params:
        size: 512
        set: val
        per_image_tokens: false
        repeats: 10
        placeholder_token: person
--max_training_steps: null
'2000': null
--token: null
TestObject: null

Lightning config
modelcheckpoint:
  params:
    every_n_train_steps: 500
callbacks:
  image_logger:
    target: main.ImageLogger
    params:
      batch_frequency: 500
      max_images: 8
      increase_log_steps: false
trainer:
  benchmark: true
  max_steps: 800
  accelerator: ddp
  gpus: 4,


Sanity Checking: 0it [00:00, ?it/s]/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:240: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 256 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Sanity Checking DataLoader 0:   0%|                                                                                                                                         | 0/2 [00:00<?, ?it/s]/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py:9: UserWarning: is_namedtuple is deprecated, please use the python checks instead
  warnings.warn("is_namedtuple is deprecated, please use the python checks instead")
Summoning checkpoint.

Traceback (most recent call last):
  File "main.py", line 830, in <module>
    trainer.fit(model, data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
    self._run_sanity_check()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
    val_loop.run()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 128, in advance
    output = self._evaluation_step(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 226, in _evaluation_step
    output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 355, in validation_step
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1015, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 976, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 93, in forward
    return self.module.validation_step(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/Dreambooth-Stable-Diffusion/ldm/models/diffusion/ddpm.py", line 368, in validation_step
    _, loss_dict_no_ema = self.shared_step(batch)
  File "/workspace/Dreambooth-Stable-Diffusion/ldm/models/diffusion/ddpm.py", line 908, in shared_step
    loss = self(x, c)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/Dreambooth-Stable-Diffusion/ldm/models/diffusion/ddpm.py", line 942, in forward
    return self.p_losses(x, c, t, *args, **kwargs)
  File "/workspace/Dreambooth-Stable-Diffusion/ldm/models/diffusion/ddpm.py", line 1093, in p_losses
    logvar_t = self.logvar[t].to(self.device)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

QueensGambit avatar Oct 06 '22 12:10 QueensGambit

Did you solve this issue? I'm still getting it when attempting to train. (M1 Mac)

StableInquest avatar Oct 10 '22 08:10 StableInquest

I used the following notebook instead which worked for me. https://colab.research.google.com/github/ShivamShrirao/diffusers/blob/main/examples/dreambooth/DreamBooth_Stable_Diffusion.ipynb#scrollTo=jjcSXTp-u-Eg However, maybe someone has a solution to this problem

QueensGambit avatar Oct 10 '22 08:10 QueensGambit

Thank you, i might have found another solution as well: pip3 install torch==1.12.0 torchvision==0.13.0 torchaudio --extra-index-url https://download.pytorch.org/whl/cu113 Switching to those versions of torch and torchvision seems to allow training on M1 so far. I'll update if it fails in some way.

StableInquest avatar Oct 10 '22 08:10 StableInquest

Now when training attempts to save anything getting this

miniconda/envs/web-ui/lib/python3.10/site-packages/torch/_tensor.py", line 223, in _reduce_ex_internal return (torch._utils._rebuild_device_tensor_from_numpy, (self.cpu().numpy(), RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.

StableInquest avatar Oct 10 '22 09:10 StableInquest

So cool to be seeing updates on this issue! I'm going through the same on a g5.2xlarge with an Nvidia A10G. How did you arrive to that pip command @StableInquest ? Was it this page?

nicolaslazo avatar Oct 10 '22 13:10 nicolaslazo

Just trial and error with different combinations.

StableInquest avatar Oct 14 '22 05:10 StableInquest

I'm having the same error now, after 10+ models trained on vast ai without problem, did you guys solve this ?

Mozoloa avatar Oct 31 '22 09:10 Mozoloa

Like Mozoloa, this started for me a couple of days ago after training 100 model successfully. Can't seem to find an answer anywhere.

boundlessliving avatar Oct 31 '22 12:10 boundlessliving

Having the same issue. I wish I could help with the development, but I'm too short on knowledge around here now.

mikocevar avatar Oct 31 '22 17:10 mikocevar

This is the solution I found!

Create a new cell in the notebook, anywhere, and run it.

!pip uninstall torch torchvision torchaudio torchtext -y
!pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
!pip install torchtext==0.13.1

After this, it started training! I'm yet to test out the results though!

mikocevar avatar Oct 31 '22 22:10 mikocevar

That's nice, lemme know if the training completes and I'll test it out !

Mozoloa avatar Oct 31 '22 22:10 Mozoloa

Should this be done before or after setting up the environment initially?On Oct 31, 2022, at 3:15 PM, degusssa @.> wrote:This is the solution I found!Create a new space in the notebook, anywhere and run it.pip uninstall torch torchvision torchaudio torchtext -y pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116 pip install torchtext==0.13.1 After this, it started training! I'm yet to test out the results though!—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

boundlessliving avatar Oct 31 '22 22:10 boundlessliving

Tried this, but got this error: Input In [3] pip uninstall torch torchvision torchaudio torchtext -y ^ SyntaxError: invalid syntaxOn Oct 31, 2022, at 3:15 PM, degusssa @.> wrote:This is the solution I found!Create a new space in the notebook, anywhere and run it.pip uninstall torch torchvision torchaudio torchtext -y pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116 pip install torchtext==0.13.1 After this, it started training! I'm yet to test out the results though!—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

boundlessliving avatar Oct 31 '22 22:10 boundlessliving

@boundlessliving I'm sorry, I forgot to add the exclamation mark before the commands, so the underlying OS gets the commands.

I edited my post, just recheck the post!

mikocevar avatar Oct 31 '22 22:10 mikocevar

@Mozoloa it works! 🎉🎉🎉 Now the fine-tuning can begin!

mikocevar avatar Oct 31 '22 23:10 mikocevar

Unfortunately I get this :

Traceback (most recent call last):
  File "/workspace/Dreambooth-Stable-Diffusion/main.py", line 819, in <module>
    data.prepare_data()
  File "/workspace/Dreambooth-Stable-Diffusion/main.py", line 278, in prepare_data
    instantiate_from_config(data_cfg)
  File "/workspace/Dreambooth-Stable-Diffusion/ldm/util.py", line 87, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()), **kwargs)
  File "/workspace/Dreambooth-Stable-Diffusion/ldm/util.py", line 95, in get_obj_from_str
    return getattr(importlib.import_module(module, package=None), cls)
  File "/opt/conda/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/workspace/Dreambooth-Stable-Diffusion/ldm/data/personalized.py", line 8, in <module>
    from captionizer import caption_from_path, generic_captions_from_path
ModuleNotFoundError: No module named 'captionizer'

Mozoloa avatar Oct 31 '22 23:10 Mozoloa

Apparently you need to do this before everything, I'm not sure what went wrong but rebuilding the venv again without the torch deinstallation worked

Mozoloa avatar Oct 31 '22 23:10 Mozoloa

I set up the environment, and downloaded the model, then ran this code, and everything worked perfectly. So glad to be back in business! Thanks so much for this!On Oct 31, 2022, at 4:57 PM, Leo Mozoloa @.> wrote:Apparently you need to do this before everything, I'm not sure what went wrong but rebuilding the venv again without the torch deinstallation worked—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>

boundlessliving avatar Nov 01 '22 00:11 boundlessliving

Try changing the indices to be on the CPU. This solved the issue for me.

logvar_t = self.logvar[t.cpu()].to(self.device)

Aboelenien avatar Nov 30 '22 18:11 Aboelenien

Try changing the indices to be on the CPU. This solved the issue for me.

logvar_t = self.logvar[t.cpu()].to(self.device)

This works for me magically

vionwinnie avatar Dec 06 '22 21:12 vionwinnie