stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[Bug]: RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What happened?
When training embedding on the NAI model
Steps to reproduce the problem
What should have happened?
No error
Commit where the problem happens
737eb28faca8be2bb996ee0930ec77d1f7ebd939
What platforms do you use to access UI ?
Windows
What browsers do you use to access the UI ?
Google Chrome
Command Line Arguments
--force-enable-xformers --listen --deepdanbooru --api --nowebui --disable-safe-unpickle
Additional information, context and logs
Running on RTX 3080Ti, CUDA works fine for inference.
Preparing dataset...
100%|████████████████████████████████████████████████████████████████████████████████| 196/196 [00:05<00:00, 37.76it/s]
Training at rate of 0.005 until step 100
0%| | 0/100000 [00:00<?, ?it/s]
Applying xformers cross attention optimization.
Error completing request
Arguments: ('nahida', '0.005:100, 1e-3:1000, 1e-5', 4, 'C:\\Users\\alien\\Downloads\\Compressed\\nahida_processed', 'textual_inversion', 512, 512, 100000, 500, 500, 'E:\\PyCharmProjects\\stable-diffusion-webui\\textual_inversion_templates\\nahida.txt', True, True, 'best quality, masterpiece, highres, original, extremely detailed, wallpaper, nahida', 'lowres, bad anatomy, bad hands, text, error, missing fingers, bad feet, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name', 28, 0, 8, -1.0, 512, 512) {}
Traceback (most recent call last):
File "E:\PyCharmProjects\stable-diffusion-webui\modules\ui.py", line 221, in f
res = list(func(*args, **kwargs))
File "E:\PyCharmProjects\stable-diffusion-webui\webui.py", line 63, in f
res = func(*args, **kwargs)
File "E:\PyCharmProjects\stable-diffusion-webui\modules\textual_inversion\ui.py", line 31, in train_embedding
embedding, filename = modules.textual_inversion.textual_inversion.train_embedding(*args)
File "E:\PyCharmProjects\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 276, in train_embedding
loss = shared.sd_model(x, c)[0]
File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "E:\PyCharmProjects\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 879, in forward
return self.p_losses(x, c, t, *args, **kwargs)
File "E:\PyCharmProjects\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 1030, in p_losses
logvar_t = self.logvar[t].to(self.device)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
I can workaround by adding t = t.to('cpu')
above that error line, but it takes up more RAM and this should still be a bug
After some GitHub pulls it somehow resolved.
Oh was this fixed in a recent PR? Because my version was from 3 days ago
the exact same issus here,while training textual inversion. my platform is AMD 6800xt rocm5.2 ,pytorch nightly, ubuntu
Same issue, on a fresh environment with a cloud-based machine. Edit, resolved by clearing up torch package version mismatches/reinstalling torch.
Same issue, on a fresh environment with a cloud-based machine. Edit, resolved by clearing up torch package version mismatches/reinstalling torch.
@rapidcopy Can you elaborate more on what you did to solve it? I have tried switching around torch 0.12-0.14 and have not managed to solve the issue.
edit: solved by pip uninstall torch torchvision functorch
(within venv) and re-running ./webui.sh
Same issue, on a fresh environment with a cloud-based machine. Edit, resolved by clearing up torch package version mismatches/reinstalling torch.
@rapidcopy Can you elaborate more on what you did to solve it? I have tried switching around torch 0.12-0.14 and have not managed to solve the issue.
edit: solved by
pip uninstall torch torchvision functorch
(within venv) and re-running./webui.sh
LOL sup I'm from IRS too
Anyways, are you sure you resolved it or just unintentionally prevented it? Because if u do this and run webui.sh, it will install torch CPU version and obviously all your tensors will be on CPU then you wont have this issue. But then you cannot use GPU at all.
LOL sup I'm from IRS too
:wave:
Anyways, are you sure you resolved it or just unintentionally prevented it? Because if u do this and run webui.sh, it will install torch CPU version and obviously all your tensors will be on CPU then you wont have this issue. But then you cannot use GPU at all.
I don't understand the code, but my GPU is currently running at 100% as I run textual inversion training, so I'm not sure.
I don't understand the code, but my GPU is currently running at 100% as I run textual inversion training, so I'm not sure.
Oh, my mistake, I mixed up with another script that only installs the torch CPU version.
I think it is a CUDA version issue then, since I am running torch built with CUDA 11.7 but if you use the webui.sh
to install, it will install the torch built for CUDA 11.3.
According to nvcc --version
I am on CUDA 11.7 too (updated after attempting to use xformers), so I have no idea why anything is working on my end
According to
nvcc --version
I am on CUDA 11.7 too (updated after attempting to use xformers), so I have no idea why anything is working on my end
Your pytorch package is built against CUDA 11.3, your environment is 11.7. It works because of CUDA minor version compatibility. If you look at https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/d4790fa6db39e9f5960110326c5597b080d3b8dd/launch.py#L108 it installs torch+cu113, the cu113 means cuda 11.3.
What I am referring to is that my pytorch is +cu117 (I have it installed before setting up this webui), so it is built against CUDA 11.7, which means this whole issue is more likely a pytorch side one specific to the CUDA 11.7 build. It may also be an issue with torch 1.13 since it is just released yesterday and the stable diffusion repo this webui uses is last updated in August.
I have cuda 118 installed, then tried pytorch stable + cu117 build and when I go training, it provides me with 'RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)'. And then I uninstalled pytorch, reinstall with cu113 build, and it worked. So I think it might be a problem with cu117 compatibility.
Same error on RX6900XT + rocm 5.2. Fresh pull. Python 3.9. Reinstall (with whole new venv) didn't change a thing.
Might it be possible that part of the model/script/anything got compiled before the upgrade, and now the two parts are not compatible?
happening to me right now. RX6600. just did a git pull.
I manually did the t.cpu()
and it was fixed. When you look at the pytorch
1.3 github there are quite a few project that are affected by this change.
Bug is still in the court of pytorch, but it seems they are aiming at
fixing all the dependent projects (and keep the warning).
On Mon, 14 Nov 2022 at 05:51, Connor King @.***> wrote:
happening to me right now. RX6600. just did a git pull.
— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3958#issuecomment-1313093551, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNME2DTNLIZTGDKZZOTVMLWIHAOFANCNFSM6AAAAAARSCC2UI . You are receiving this because you commented.Message ID: @.***>
For people who are looking at a quick solution and want a more specific detail more than just saying that a line to fix it was added somewhere, here is the change i did. This has the benefit that if somehow it does not force it on to the CPU for slightly better device residency.
diff --git a/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py
--- a/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py
+++ b/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py
@@ -900,6 +900,7 @@ class LatentDiffusion(DDPM):
loss_simple = self.get_loss(model_output, target, mean=False).mean([1, 2, 3])
loss_dict.update({f'{prefix}/loss_simple': loss_simple.mean()})
+ self.logvar = self.logvar.to(self.device)
logvar_t = self.logvar[t].to(self.device)
loss = loss_simple / torch.exp(logvar_t) + logvar_t
# loss = loss_simple / torch.exp(self.logvar) + self.logvar
I actually came across this exact suggested fix and it did not work. It still tries to place the self.logvar on my GPU. Also, this is also a 'line to fix added somewhere'.
I just ran into this tonight on Colab. I no longer can train anything thanks to this error.
I just ran into this tonight on Colab. I no longer can train anything thanks to this error.
Same issue for when training Hypernetwork on Colab
Yes,
I just ran into this tonight on Colab. I no longer can train anything thanks to this error.
Same issue for when training Hypernetwork on Colab
Yes, I ran into it at first trying to train HN then I decided to see if this was just for HN (was my first time) and TI, as well as DB hit this error.
@glop102 This line fixed it for me
@glop102 Worked for me on Colab as well.
Thank you.
edit: For colab users this is where glop102's code needs to be inserted -> /content/gdrive/MyDrive/sd/stablediffusion/ldm/models/diffusion/ddpm.py
Closing as stale.