stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Open aliencaocao opened this issue 2 years ago • 22 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

When training embedding on the NAI model

Steps to reproduce the problem

image

What should have happened?

No error

Commit where the problem happens

737eb28faca8be2bb996ee0930ec77d1f7ebd939

What platforms do you use to access UI ?

Windows

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--force-enable-xformers --listen --deepdanbooru --api --nowebui --disable-safe-unpickle

Additional information, context and logs

Running on RTX 3080Ti, CUDA works fine for inference.

Preparing dataset...
100%|████████████████████████████████████████████████████████████████████████████████| 196/196 [00:05<00:00, 37.76it/s]
Training at rate of 0.005 until step 100
  0%|                                                                                       | 0/100000 [00:00<?, ?it/s]
Applying xformers cross attention optimization.
Error completing request
Arguments: ('nahida', '0.005:100, 1e-3:1000, 1e-5', 4, 'C:\\Users\\alien\\Downloads\\Compressed\\nahida_processed', 'textual_inversion', 512, 512, 100000, 500, 500, 'E:\\PyCharmProjects\\stable-diffusion-webui\\textual_inversion_templates\\nahida.txt', True, True, 'best quality, masterpiece, highres, original, extremely detailed, wallpaper, nahida', 'lowres, bad anatomy, bad hands, text, error, missing fingers, bad feet, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name', 28, 0, 8, -1.0, 512, 512) {}
Traceback (most recent call last):
  File "E:\PyCharmProjects\stable-diffusion-webui\modules\ui.py", line 221, in f
    res = list(func(*args, **kwargs))
  File "E:\PyCharmProjects\stable-diffusion-webui\webui.py", line 63, in f
    res = func(*args, **kwargs)
  File "E:\PyCharmProjects\stable-diffusion-webui\modules\textual_inversion\ui.py", line 31, in train_embedding
    embedding, filename = modules.textual_inversion.textual_inversion.train_embedding(*args)
  File "E:\PyCharmProjects\stable-diffusion-webui\modules\textual_inversion\textual_inversion.py", line 276, in train_embedding
    loss = shared.sd_model(x, c)[0]
  File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "E:\PyCharmProjects\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 879, in forward
    return self.p_losses(x, c, t, *args, **kwargs)
  File "E:\PyCharmProjects\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 1030, in p_losses
    logvar_t = self.logvar[t].to(self.device)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

aliencaocao avatar Oct 30 '22 02:10 aliencaocao

I can workaround by adding t = t.to('cpu') above that error line, but it takes up more RAM and this should still be a bug

aliencaocao avatar Oct 30 '22 03:10 aliencaocao

After some GitHub pulls it somehow resolved.

jordanjalles avatar Oct 30 '22 04:10 jordanjalles

Oh was this fixed in a recent PR? Because my version was from 3 days ago

aliencaocao avatar Oct 30 '22 04:10 aliencaocao

the exact same issus here,while training textual inversion. my platform is AMD 6800xt rocm5.2 ,pytorch nightly, ubuntu

Arc10p avatar Oct 30 '22 08:10 Arc10p

Same issue, on a fresh environment with a cloud-based machine. Edit, resolved by clearing up torch package version mismatches/reinstalling torch.

rabidcopy avatar Oct 30 '22 21:10 rabidcopy

Same issue, on a fresh environment with a cloud-based machine. Edit, resolved by clearing up torch package version mismatches/reinstalling torch.

@rapidcopy Can you elaborate more on what you did to solve it? I have tried switching around torch 0.12-0.14 and have not managed to solve the issue.

edit: solved by pip uninstall torch torchvision functorch (within venv) and re-running ./webui.sh

152334H avatar Oct 31 '22 08:10 152334H

Same issue, on a fresh environment with a cloud-based machine. Edit, resolved by clearing up torch package version mismatches/reinstalling torch.

@rapidcopy Can you elaborate more on what you did to solve it? I have tried switching around torch 0.12-0.14 and have not managed to solve the issue.

edit: solved by pip uninstall torch torchvision functorch (within venv) and re-running ./webui.sh

LOL sup I'm from IRS too

Anyways, are you sure you resolved it or just unintentionally prevented it? Because if u do this and run webui.sh, it will install torch CPU version and obviously all your tensors will be on CPU then you wont have this issue. But then you cannot use GPU at all.

aliencaocao avatar Oct 31 '22 09:10 aliencaocao

LOL sup I'm from IRS too

:wave:

Anyways, are you sure you resolved it or just unintentionally prevented it? Because if u do this and run webui.sh, it will install torch CPU version and obviously all your tensors will be on CPU then you wont have this issue. But then you cannot use GPU at all.

I don't understand the code, but my GPU is currently running at 100% as I run textual inversion training, so I'm not sure.

152334H avatar Oct 31 '22 09:10 152334H

I don't understand the code, but my GPU is currently running at 100% as I run textual inversion training, so I'm not sure.

Oh, my mistake, I mixed up with another script that only installs the torch CPU version. I think it is a CUDA version issue then, since I am running torch built with CUDA 11.7 but if you use the webui.sh to install, it will install the torch built for CUDA 11.3.

aliencaocao avatar Oct 31 '22 09:10 aliencaocao

According to nvcc --version I am on CUDA 11.7 too (updated after attempting to use xformers), so I have no idea why anything is working on my end

152334H avatar Oct 31 '22 09:10 152334H

According to nvcc --version I am on CUDA 11.7 too (updated after attempting to use xformers), so I have no idea why anything is working on my end

Your pytorch package is built against CUDA 11.3, your environment is 11.7. It works because of CUDA minor version compatibility. If you look at https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/d4790fa6db39e9f5960110326c5597b080d3b8dd/launch.py#L108 it installs torch+cu113, the cu113 means cuda 11.3.

What I am referring to is that my pytorch is +cu117 (I have it installed before setting up this webui), so it is built against CUDA 11.7, which means this whole issue is more likely a pytorch side one specific to the CUDA 11.7 build. It may also be an issue with torch 1.13 since it is just released yesterday and the stable diffusion repo this webui uses is last updated in August.

aliencaocao avatar Oct 31 '22 09:10 aliencaocao

I have cuda 118 installed, then tried pytorch stable + cu117 build and when I go training, it provides me with 'RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)'. And then I uninstalled pytorch, reinstall with cu113 build, and it worked. So I think it might be a problem with cu117 compatibility.

pessimo avatar Nov 10 '22 05:11 pessimo

Same error on RX6900XT + rocm 5.2. Fresh pull. Python 3.9. Reinstall (with whole new venv) didn't change a thing.

Might it be possible that part of the model/script/anything got compiled before the upgrade, and now the two parts are not compatible?

red1939 avatar Nov 13 '22 15:11 red1939

happening to me right now. RX6600. just did a git pull.

SIGSTACKFAULT avatar Nov 14 '22 04:11 SIGSTACKFAULT

I manually did the t.cpu() and it was fixed. When you look at the pytorch 1.3 github there are quite a few project that are affected by this change. Bug is still in the court of pytorch, but it seems they are aiming at fixing all the dependent projects (and keep the warning).

On Mon, 14 Nov 2022 at 05:51, Connor King @.***> wrote:

happening to me right now. RX6600. just did a git pull.

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3958#issuecomment-1313093551, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNME2DTNLIZTGDKZZOTVMLWIHAOFANCNFSM6AAAAAARSCC2UI . You are receiving this because you commented.Message ID: @.***>

red1939 avatar Nov 14 '22 07:11 red1939

For people who are looking at a quick solution and want a more specific detail more than just saying that a line to fix it was added somewhere, here is the change i did. This has the benefit that if somehow it does not force it on to the CPU for slightly better device residency.

diff --git a/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py
--- a/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py
+++ b/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py
@@ -900,6 +900,7 @@ class LatentDiffusion(DDPM):
         loss_simple = self.get_loss(model_output, target, mean=False).mean([1, 2, 3])
         loss_dict.update({f'{prefix}/loss_simple': loss_simple.mean()})
 
+        self.logvar = self.logvar.to(self.device)
         logvar_t = self.logvar[t].to(self.device)
         loss = loss_simple / torch.exp(logvar_t) + logvar_t
         # loss = loss_simple / torch.exp(self.logvar) + self.logvar

glop102 avatar Nov 29 '22 20:11 glop102

I actually came across this exact suggested fix and it did not work. It still tries to place the self.logvar on my GPU. Also, this is also a 'line to fix added somewhere'.

aliencaocao avatar Nov 30 '22 01:11 aliencaocao

I just ran into this tonight on Colab. I no longer can train anything thanks to this error.

DarkAlchy avatar Dec 08 '22 05:12 DarkAlchy

I just ran into this tonight on Colab. I no longer can train anything thanks to this error.

Same issue for when training Hypernetwork on Colab

image

image

SorenTruelsen avatar Dec 08 '22 15:12 SorenTruelsen

Yes,

I just ran into this tonight on Colab. I no longer can train anything thanks to this error.

Same issue for when training Hypernetwork on Colab

image

image

Yes, I ran into it at first trying to train HN then I decided to see if this was just for HN (was my first time) and TI, as well as DB hit this error.

DarkAlchy avatar Dec 08 '22 19:12 DarkAlchy

@glop102 This line fixed it for me

codefionn avatar Dec 08 '22 22:12 codefionn

@glop102 Worked for me on Colab as well.

Thank you.

edit: For colab users this is where glop102's code needs to be inserted -> /content/gdrive/MyDrive/sd/stablediffusion/ldm/models/diffusion/ddpm.py

DarkAlchy avatar Dec 10 '22 04:12 DarkAlchy

Closing as stale.

catboxanon avatar Aug 03 '23 15:08 catboxanon