Dreambooth-Stable-Diffusion
Dreambooth-Stable-Diffusion copied to clipboard
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
Hello,
I am encountering a device match error even though the GPUs were successfully detected.
Started from a docker container nvcr.io/nvidia/pytorch:22.09-py3
.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
#### Data #####
train, PersonalizedBase, 4700
reg, PersonalizedBase, 15000
validation, PersonalizedBase, 47
accumulate_grad_batches = 1
++++ NOT USING LR SCALING ++++
Setting learning rate to 1.00e-06
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:326: LightningDeprecationWarning: Base `LightningModule.on_train_batch_start` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
rank_zero_deprecation(
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:335: LightningDeprecationWarning: The `on_keyboard_interrupt` callback hook was deprecated in v1.5 and will be removed in v1.7. Please use the `on_exception` callback hook instead.
rank_zero_deprecation(
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:391: LightningDeprecationWarning: The `Callback.on_pretrain_routine_start` hook has been deprecated in v1.6 and will be removed in v1.8. Please use `Callback.on_fit_start` instead.
rank_zero_deprecation(
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:342: LightningDeprecationWarning: Base `Callback.on_train_batch_end` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
rank_zero_deprecation(
Global seed set to 23
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LatentDiffusion: Also optimizing conditioner params!
| Name | Type | Params
---------------------------------------------------------
0 | model | DiffusionWrapper | 859 M
1 | first_stage_model | AutoencoderKL | 83.7 M
2 | cond_stage_model | FrozenCLIPEmbedder | 123 M
---------------------------------------------------------
982 M Trainable params
83.7 M Non-trainable params
1.1 B Total params
4,264.941 Total estimated model params size (MB)
Project config
model:
base_learning_rate: 1.0e-06
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
reg_weight: 1.0
linear_start: 0.00085
linear_end: 0.012
num_timesteps_cond: 1
log_every_t: 200
timesteps: 1000
first_stage_key: image
cond_stage_key: caption
image_size: 64
channels: 4
cond_stage_trainable: true
conditioning_key: crossattn
monitor: val/loss_simple_ema
scale_factor: 0.18215
use_ema: false
embedding_reg_weight: 0.0
unfreeze_model: true
model_lr: 1.0e-06
personalization_config:
target: ldm.modules.embedding_manager.EmbeddingManager
params:
placeholder_strings:
- '*'
initializer_words:
- sculpture
per_image_tokens: false
num_vectors_per_token: 1
progressive_words: false
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
image_size: 32
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions:
- 4
- 2
- 1
num_res_blocks: 2
channel_mult:
- 1
- 2
- 4
- 4
num_heads: 8
use_spatial_transformer: true
transformer_depth: 1
context_dim: 768
use_checkpoint: true
legacy: false
first_stage_config:
target: ldm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 512
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
ckpt_path: model.ckpt
data:
target: main.DataModuleFromConfig
params:
batch_size: 1
num_workers: 2
wrap: false
train:
target: ldm.data.personalized.PersonalizedBase
params:
size: 512
set: train
per_image_tokens: false
repeats: 100
placeholder_token: person
reg:
target: ldm.data.personalized.PersonalizedBase
params:
size: 512
set: train
reg: true
per_image_tokens: false
repeats: 10
placeholder_token: person
validation:
target: ldm.data.personalized.PersonalizedBase
params:
size: 512
set: val
per_image_tokens: false
repeats: 10
placeholder_token: person
--max_training_steps: null
'2000': null
--token: null
TestObject: null
Lightning config
modelcheckpoint:
params:
every_n_train_steps: 500
callbacks:
image_logger:
target: main.ImageLogger
params:
batch_frequency: 500
max_images: 8
increase_log_steps: false
trainer:
benchmark: true
max_steps: 800
accelerator: ddp
gpus: 4,
Sanity Checking: 0it [00:00, ?it/s]/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:240: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 256 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py:9: UserWarning: is_namedtuple is deprecated, please use the python checks instead
warnings.warn("is_namedtuple is deprecated, please use the python checks instead")
Summoning checkpoint.
Traceback (most recent call last):
File "main.py", line 830, in <module>
trainer.fit(model, data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
self._run_sanity_check()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
val_loop.run()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 128, in advance
output = self._evaluation_step(**kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 226, in _evaluation_step
output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 355, in validation_step
return self.model(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1015, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 976, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 93, in forward
return self.module.validation_step(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/workspace/Dreambooth-Stable-Diffusion/ldm/models/diffusion/ddpm.py", line 368, in validation_step
_, loss_dict_no_ema = self.shared_step(batch)
File "/workspace/Dreambooth-Stable-Diffusion/ldm/models/diffusion/ddpm.py", line 908, in shared_step
loss = self(x, c)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/Dreambooth-Stable-Diffusion/ldm/models/diffusion/ddpm.py", line 942, in forward
return self.p_losses(x, c, t, *args, **kwargs)
File "/workspace/Dreambooth-Stable-Diffusion/ldm/models/diffusion/ddpm.py", line 1093, in p_losses
logvar_t = self.logvar[t].to(self.device)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
Did you solve this issue? I'm still getting it when attempting to train. (M1 Mac)
I used the following notebook instead which worked for me. https://colab.research.google.com/github/ShivamShrirao/diffusers/blob/main/examples/dreambooth/DreamBooth_Stable_Diffusion.ipynb#scrollTo=jjcSXTp-u-Eg However, maybe someone has a solution to this problem
Thank you, i might have found another solution as well:
pip3 install torch==1.12.0 torchvision==0.13.0 torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
Switching to those versions of torch and torchvision seems to allow training on M1 so far. I'll update if it fails in some way.
Now when training attempts to save anything getting this
miniconda/envs/web-ui/lib/python3.10/site-packages/torch/_tensor.py", line 223, in _reduce_ex_internal return (torch._utils._rebuild_device_tensor_from_numpy, (self.cpu().numpy(), RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
So cool to be seeing updates on this issue! I'm going through the same on a g5.2xlarge with an Nvidia A10G. How did you arrive to that pip command @StableInquest ? Was it this page?
Just trial and error with different combinations.
I'm having the same error now, after 10+ models trained on vast ai without problem, did you guys solve this ?
Like Mozoloa, this started for me a couple of days ago after training 100 model successfully. Can't seem to find an answer anywhere.
Having the same issue. I wish I could help with the development, but I'm too short on knowledge around here now.
This is the solution I found!
Create a new cell in the notebook, anywhere, and run it.
!pip uninstall torch torchvision torchaudio torchtext -y
!pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
!pip install torchtext==0.13.1
After this, it started training! I'm yet to test out the results though!
That's nice, lemme know if the training completes and I'll test it out !
Should this be done before or after setting up the environment initially?On Oct 31, 2022, at 3:15 PM, degusssa @.> wrote:This is the solution I found!Create a new space in the notebook, anywhere and run it.pip uninstall torch torchvision torchaudio torchtext -y pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116 pip install torchtext==0.13.1 After this, it started training! I'm yet to test out the results though!—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>
Tried this, but got this error: Input In [3] pip uninstall torch torchvision torchaudio torchtext -y ^ SyntaxError: invalid syntaxOn Oct 31, 2022, at 3:15 PM, degusssa @.> wrote:This is the solution I found!Create a new space in the notebook, anywhere and run it.pip uninstall torch torchvision torchaudio torchtext -y pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116 pip install torchtext==0.13.1 After this, it started training! I'm yet to test out the results though!—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>
@boundlessliving I'm sorry, I forgot to add the exclamation mark before the commands, so the underlying OS gets the commands.
I edited my post, just recheck the post!
@Mozoloa it works! 🎉🎉🎉 Now the fine-tuning can begin!
Unfortunately I get this :
Traceback (most recent call last):
File "/workspace/Dreambooth-Stable-Diffusion/main.py", line 819, in <module>
data.prepare_data()
File "/workspace/Dreambooth-Stable-Diffusion/main.py", line 278, in prepare_data
instantiate_from_config(data_cfg)
File "/workspace/Dreambooth-Stable-Diffusion/ldm/util.py", line 87, in instantiate_from_config
return get_obj_from_str(config["target"])(**config.get("params", dict()), **kwargs)
File "/workspace/Dreambooth-Stable-Diffusion/ldm/util.py", line 95, in get_obj_from_str
return getattr(importlib.import_module(module, package=None), cls)
File "/opt/conda/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/workspace/Dreambooth-Stable-Diffusion/ldm/data/personalized.py", line 8, in <module>
from captionizer import caption_from_path, generic_captions_from_path
ModuleNotFoundError: No module named 'captionizer'
Apparently you need to do this before everything, I'm not sure what went wrong but rebuilding the venv again without the torch deinstallation worked
I set up the environment, and downloaded the model, then ran this code, and everything worked perfectly. So glad to be back in business! Thanks so much for this!On Oct 31, 2022, at 4:57 PM, Leo Mozoloa @.> wrote:Apparently you need to do this before everything, I'm not sure what went wrong but rebuilding the venv again without the torch deinstallation worked—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>
Try changing the indices to be on the CPU. This solved the issue for me.
logvar_t = self.logvar[t.cpu()].to(self.device)
Try changing the indices to be on the CPU. This solved the issue for me.
logvar_t = self.logvar[t.cpu()].to(self.device)
This works for me magically