stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: Image generation won't start forever (Linux+ROCm, possibly specific to RX 5000 series)

Open cyatarow opened this issue 1 year ago • 37 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

I have newly installed v1.3.0, but image generation won't start even after many minutes of pressing "Generate" button.

Steps to reproduce the problem

  1. Launch the UI by webui.sh
  2. Go to http://127.0.0.1:7860 with a browser
  3. Press "Generate" for any prompt or model

What should have happened?

Image generation should have started.

Commit where the problem happens

20ae71faa8ef035c31aa3a410b707d792c8203a3

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Linux

What device are you running WebUI on?

AMD GPUs (RX 5000 below)

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

`--ckpt-dir` and `--vae-dir`
I'm using external storage to place model files.

List of extensions

(None)

Console logs

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on sd-amd user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc.so.4
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
Version: v1.3.0
Commit hash: 20ae71faa8ef035c31aa3a410b707d792c8203a3
Installing torch and torchvision
Looking in indexes: https://download.pytorch.org/whl/rocm5.4.2
Collecting torch==2.0.1+rocm5.4.2
  Using cached https://download.pytorch.org/whl/rocm5.4.2/torch-2.0.1%2Brocm5.4.2-cp310-cp310-linux_x86_64.whl (1536.4 MB)
Collecting torchvision==0.15.2+rocm5.4.2
  Using cached https://download.pytorch.org/whl/rocm5.4.2/torchvision-0.15.2%2Brocm5.4.2-cp310-cp310-linux_x86_64.whl (62.4 MB)
Collecting filelock
  Using cached https://download.pytorch.org/whl/filelock-3.9.0-py3-none-any.whl (9.7 kB)
Collecting networkx
  Using cached https://download.pytorch.org/whl/networkx-3.0-py3-none-any.whl (2.0 MB)
Collecting sympy
  Using cached https://download.pytorch.org/whl/sympy-1.11.1-py3-none-any.whl (6.5 MB)
Collecting pytorch-triton-rocm<2.1,>=2.0.0
  Using cached https://download.pytorch.org/whl/pytorch_triton_rocm-2.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (78.4 MB)
Collecting jinja2
  Using cached https://download.pytorch.org/whl/Jinja2-3.1.2-py3-none-any.whl (133 kB)
Collecting typing-extensions
  Using cached https://download.pytorch.org/whl/typing_extensions-4.4.0-py3-none-any.whl (26 kB)
Collecting requests
  Using cached https://download.pytorch.org/whl/requests-2.28.1-py3-none-any.whl (62 kB)
Collecting pillow!=8.3.*,>=5.3.0
  Using cached https://download.pytorch.org/whl/Pillow-9.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
Collecting numpy
  Using cached https://download.pytorch.org/whl/numpy-1.24.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting cmake
  Using cached https://download.pytorch.org/whl/cmake-3.25.0-py2.py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.7 MB)
Collecting lit
  Using cached https://download.pytorch.org/whl/lit-15.0.7.tar.gz (132 kB)
  Preparing metadata (setup.py) ... done
Collecting MarkupSafe>=2.0
  Using cached https://download.pytorch.org/whl/MarkupSafe-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Collecting certifi>=2017.4.17
  Using cached https://download.pytorch.org/whl/certifi-2022.12.7-py3-none-any.whl (155 kB)
Collecting idna<4,>=2.5
  Using cached https://download.pytorch.org/whl/idna-3.4-py3-none-any.whl (61 kB)
Collecting urllib3<1.27,>=1.21.1
  Using cached https://download.pytorch.org/whl/urllib3-1.26.13-py2.py3-none-any.whl (140 kB)
Collecting charset-normalizer<3,>=2
  Using cached https://download.pytorch.org/whl/charset_normalizer-2.1.1-py3-none-any.whl (39 kB)
Collecting mpmath>=0.19
  Using cached https://download.pytorch.org/whl/mpmath-1.2.1-py3-none-any.whl (532 kB)
Using legacy 'setup.py install' for lit, since package 'wheel' is not installed.
Installing collected packages: mpmath, lit, cmake, urllib3, typing-extensions, sympy, pillow, numpy, networkx, MarkupSafe, idna, filelock, charset-normalizer, certifi, requests, jinja2, pytorch-triton-rocm, torch, torchvision
  Running setup.py install for lit ... done
Successfully installed MarkupSafe-2.1.2 certifi-2022.12.7 charset-normalizer-2.1.1 cmake-3.25.0 filelock-3.9.0 idna-3.4 jinja2-3.1.2 lit-15.0.7 mpmath-1.2.1 networkx-3.0 numpy-1.24.1 pillow-9.3.0 pytorch-triton-rocm-2.0.1 requests-2.28.1 sympy-1.11.1 torch-2.0.1+rocm5.4.2 torchvision-0.15.2+rocm5.4.2 typing-extensions-4.4.0 urllib3-1.26.13
Installing gfpgan
Installing clip
Installing open_clip
Cloning Stable Diffusion into /home/sd-amd/sd-ui-130/stable-diffusion-webui/repositories/stable-diffusion-stability-ai...
Cloning Taming Transformers into /home/sd-amd/sd-ui-130/stable-diffusion-webui/repositories/taming-transformers...
Cloning K-diffusion into /home/sd-amd/sd-ui-130/stable-diffusion-webui/repositories/k-diffusion...
Cloning CodeFormer into /home/sd-amd/sd-ui-130/stable-diffusion-webui/repositories/CodeFormer...
Cloning BLIP into /home/sd-amd/sd-ui-130/stable-diffusion-webui/repositories/BLIP...
Installing requirements for CodeFormer
Installing requirements
Launching Web UI with arguments: --ckpt-dir /mnt/W20/Stable_Diffusion/MODEL --vae-dir /mnt/W20/Stable_Diffusion/VAE
No module 'xformers'. Proceeding without it.
Calculating sha256 for /mnt/W20/Stable_Diffusion/MODEL/AnythingV5_v5PrtRE.safetensors: Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 2.7s (import torch: 0.5s, import gradio: 0.6s, import ldm: 0.6s, other imports: 0.4s, load scripts: 0.3s, create ui: 0.2s).
7f96a1a9ca9b3a3242a9ae95d19284f0d2da8d5282b42d2d974398bf7663a252
Loading weights [7f96a1a9ca] from /mnt/W20/Stable_Diffusion/MODEL/AnythingV5_v5PrtRE.safetensors
Creating model from config: /home/sd-amd/sd-ui-130/stable-diffusion-webui/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Applying optimization: sdp-no-mem... done.
Textual inversion embeddings loaded(0): 
Model loaded in 4.0s (calculate hash: 2.7s, create model: 0.3s, apply weights to model: 0.3s, apply half(): 0.2s, load VAE: 0.1s, move model to device: 0.2s).

Additional information

My environment:

  • OS: Ubuntu 22.04.2
  • CPU: Intel Core i3-12100
  • GPU: AMD Radeon RX 5500 XT (8GB)

cyatarow avatar May 30 '23 09:05 cyatarow

I have exactly the same issue, used to work perfectly before.

Like you say, it just sits there and doesn't do anything, no errors anywhere.

I've uninstalled/reinstalled everything and tried various different combinations, no good.

Previously I would get the classic: "MIOpen(HIP): Warning [SQLiteBase] Missing system database file: gfx1030_40.kdb Performance may degrade. " warning, but then after about a minute it would start working and then work correctly. Now I don't get that warning, suggesting that might be the point that it falters.

I'm using an AMD Radeon RX 5700 XT (8GB), Ryzen 3700 CPU, Arch Linux. So similar to you but not exactly the same.

Fingers crossed somebody can suggest something! Previously on this system I've had SD working well through all the updates from September last year to a couple of weeks ago.

olinorwell avatar May 31 '23 10:05 olinorwell

Same issue, no errors, just not generating anything AMD Radeon RX 5700 XT, Ryzen 3600, Manjaro, kernel 6.3.4-2

HoCoK31 avatar May 31 '23 12:05 HoCoK31

Could it be this problem is specific to RX 5000 series?

cyatarow avatar May 31 '23 12:05 cyatarow

I fear it might be related to the fact that the 5000 series wasn't supposed to work originally, but then we got a workaround to do with 'fooling something' into believing it was a different chip, after which it then worked. Perhaps that trick isn't working now, and it's just unable to function. There must be many others in the same situation out there. Hopefully they will all comment on this post.

olinorwell avatar May 31 '23 13:05 olinorwell

To confirm to anyone trying to help - at least in my case it used to immediately give the warning: "MIOpen(HIP): Warning [SQLiteBase] Missing system database file: gfx1030_40.kdb Performance may degrade."

This no longer happens. So whatever is different is after the Generate button is hit, and before the warning would be outputted.

[Edit: Additionally, I ran the tests for PyTorch found here - https://pytorch.org/get-started/locally/ suggesting that PyTorch RocM is working as expected]

[Edit 2: Not sure if it's useful to know, but I did recently install OpenCL on my machine, I was reading that OpenCL/HIP backends are potentially not compatible side-by-side when using RocM. I don't fully understand all of this but my gut feeling is it could be something to do with that - but then, maybe others haven't recently installed OpenCL]

olinorwell avatar May 31 '23 13:05 olinorwell

In fact, inspired by this PR, I had tried the dev branch shortly before v1.3.0 was released. But the result was the same...

The participants in the PR were only RX 6000 users, and I think the merge was forced without decent verification with 5000 series.

cyatarow avatar May 31 '23 13:05 cyatarow

I agree, I fear that change is what has broken it for RX 5000 users. According to that PR it was needed due to old versions not being available on the pytorch repos. I wonder if they are still available elsewhere. I fear we're going to need the 1.3 version again, avoiding the 2.0 version which doesn't appear to work. It at times like this when I really get mad at myself for updating anything! It was all working so well.

olinorwell avatar May 31 '23 14:05 olinorwell

But I have the exact same issue on the 6600m gfx1031? with r7 5800h Without --medvram it doesnt proceed after - Applying optimization: sdp-no-mem... done. With it, the model loads but nothing generates and nothing else happens in the terminal

VekuDazo avatar May 31 '23 16:05 VekuDazo

Same here (RX 5700) with ROCm 5.5 The only solution for now is to force downgrade to torch 1.13.1 pip install torch==1.13.1 torchvision==0.14.1 --index-url https://download.pytorch.org/whl/rocm5.2

has anyone tried with a torch 2.0 build for ROCm version 5.5? For now the newest one in nightly is still 5.4.2 https://download.pytorch.org/whl/nightly/torch/

ethragur avatar May 31 '23 18:05 ethragur

Even force downgrading was failing for me, I had instructions that had a '+rocm' next to the package versions? When I tried without it appeared to download the Nvidia versions.

What would be the way to try the 5.5 version? I can try that now.

olinorwell avatar May 31 '23 18:05 olinorwell

What would be the way to try the 5.5 version? I can try that now.

You would have to build pytorch yourself with the ROCm 5.5 version. Maybe something like #9591, the docker image they use does not exist anymore, but the one from the official pytorch docker repo could still work (https://hub.docker.com/r/rocm/pytorch/tags)

rocm/pytorch:rocm5.5_ubuntu20.04_py3.8_pytorch_staging

But I'm not really sure if that would make it work, even if we'd be able to compile it, maybe there is something that doesn't work in the new pytorch version with rx5X00 graphics cards.

Even force downgrading was failing for me, I had instructions that had a '+rocm' next to the package versions? When I tried without it appeared to download the Nvidia versions.

Maybe you had '--extra-index-url' instead of '--index-url'. You could also just go into your venv directory: stable-diffusion-webui/venv/lib/python3.10/site-packages and delete torch & torchvision. Afterwards you should just be able to use my pip install cmd.

Additionally I added the export TORCH_COMMAND= "pip install torch==1.13.1 torchvision==0.14.1 --index-url https://download.pytorch.org/whl/rocm5.2" to my webui-user.sh, and I started the webui with ./webui.sh

ethragur avatar May 31 '23 18:05 ethragur

(venv) [oli@ARCH-RYZEN stable-diffusion-webui]$ pip install torch==1.13.1 torchvision==0.14.1 --index-url https://download.pytorch.org/whl/rocm5.2 Looking in indexes: https://download.pytorch.org/whl/rocm5.2 ERROR: Could not find a version that satisfies the requirement torch==1.13.1 (from versions: none) ERROR: No matching distribution found for torch==1.13.1

I wonder if the fact they bumped the Python version up to 3.11 makes a difference? I see you were running 3.10.

olinorwell avatar May 31 '23 19:05 olinorwell

I wonder if the fact they bumped the Python version up to 3.11 makes a difference? I see you were running 3.10.

https://download.pytorch.org/whl/rocm5.2/torch/ it looks like it, pytorch seems to only have builds for 3.10

ethragur avatar May 31 '23 19:05 ethragur

I'm retrying now with 3.10. Fingers crossed.

olinorwell avatar May 31 '23 19:05 olinorwell

Otherwise you could try to download the .whl file and just install it directly with pip:

pip install /path/to/file.whl

ethragur avatar May 31 '23 19:05 ethragur

Success! @ethragur is the hero, his solution has worked for me. I'm now running v1.3.0 of A1111 on my 5700XT.

My solution was this - ensure you have Python 3.10 and edit the webui.sh file to make sure it uses Python 3.10.

Run webui.sh and let it create the venv etc and then fail to create an image.

Run: source venv/bin/activate

Then run (thanks to @ethragur) pip install torch==1.13.1 torchvision==0.14.1 --index-url https://download.pytorch.org/whl/rocm5.2

Now restart webui.sh and this time image generation will succeed, you'll see at the bottom of A1111 that the version number says "torch: 1.13.1+rocm5.2".

Hopefully what has worked for me will work for others too, thanks again to @ethragur for the help - I was getting very down at not having SD to play with!

olinorwell avatar May 31 '23 19:05 olinorwell

Perfect, good to hear that it works again. Hopefully some future builds of pytorch will also work again with the rx5000 series, otherwise we'll be stuck on this version forever :cry:. From what I've seen, 2.0 should give some performance improvements.

I'll try building the new version in a docker container, and if it works I'll upload the .whl file somewhere. But I do not have high hopes. Maybe there is some way to get more debug information out of pytorch to see where it is stuck

ethragur avatar May 31 '23 19:05 ethragur

Any contributors notice this issue?

cyatarow avatar Jun 02 '23 16:06 cyatarow

v1.3.1, released yesterday, doesn't seem to have this fix... too bad.

cyatarow avatar Jun 03 '23 01:06 cyatarow

@AUTOMATIC1111 please don't ignore us...

cyatarow avatar Jun 05 '23 15:06 cyatarow

Same issue, 5700 XT both on torch 1.13.1 and 2.0. Oddly enough, I just borrowed this card today from a friend and managed to get a single gen in before this bug occured

EDIT: It started generating the entire prompt in a couple seconds, after waiting for 2 minutes. After that incident, my system became really sluggish. Prompts were generating again, but the speed was inconsistent

magusman52 avatar Jun 05 '23 20:06 magusman52

Same issue, 5700 XT both on torch 1.13.1 and 2.0. Oddly enough, I just borrowed this card today from a friend and managed to get a single gen in before this bug occured

EDIT: It started generating the entire prompt in a couple seconds, after waiting for 2 minutes. After that incident, my system became really sluggish. Prompts were generating again, but the speed was inconsistent

Is this Windows or Linux?

For me it was cut and dry, torch 2.0 doesn't work, torch 1.13.1 does. Perhaps check versions, etc? I always have a one minute delay before generations begin each time, but that's been like that since the beginning, and after it's done what it needs to do then I don't experience problems afterwards.

olinorwell avatar Jun 05 '23 21:06 olinorwell

Same issue, 5700 XT both on torch 1.13.1 and 2.0. Oddly enough, I just borrowed this card today from a friend and managed to get a single gen in before this bug occured EDIT: It started generating the entire prompt in a couple seconds, after waiting for 2 minutes. After that incident, my system became really sluggish. Prompts were generating again, but the speed was inconsistent

Is this Windows or Linux?

For me it was cut and dry, torch 2.0 doesn't work, torch 1.13.1 does. Perhaps check versions, etc? I always have a one minute delay before generations begin each time, but that's been like that since the beginning, and after it's done what it needs to do then I don't experience problems afterwards.

I'm on Ubuntu 22.04. And yes it occurs with both versions of torch. Prompt loads for a minute or two, first 90% of the gen gets done in a couple seconds, gets stuck at 97% again for a while, and then finished the prompt. Also my system seems to get really unstable after prompting, as if it's about to crash or blackscreen. Quite odd.

EDIT: Tested again, now it only occurs on torch 2.0. Works alright on 1.13.1 besides the initial lag.

magusman52 avatar Jun 05 '23 21:06 magusman52

I made a PR to force pytorch 1.13.1 for RX 5000 cards. also checks for python <= 3.10 Not a definitive fix, but maybe it can help other users

https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/11048

DGdev91 avatar Jun 06 '23 22:06 DGdev91

But still, why is only RX 5000 series soooo incompatible with torch 2.0??

cyatarow avatar Jun 06 '23 23:06 cyatarow

But still, why is only RX 5000 series soooo incompatible with torch 2.0??

That's a good question. My first guess is that we need to force HSA_OVERRIDE_GFX_VERSION to make it work, but that's also trie for RX 6000, wich is working just fine.

Sooo.... Who knows.

We can't even be really sure it's just RX 5000, maybe there are other series wich have problems but no one has reported it yet

DGdev91 avatar Jun 07 '23 08:06 DGdev91

HSA_OVERRIDE_GFX_VERSION is already forced though in the script for those cards - it was set correctly for me even when things weren't working. Perhaps Torch v2.0 needs a further workaround or something.

I just hope code doesn't slip into the repo that's only torch 2.0 compatible, then we're in trouble.

olinorwell avatar Jun 07 '23 08:06 olinorwell

But still, why is only RX 5000 series soooo incompatible with torch 2.0??

That's a good question. My first guess is that we need to force HSA_OVERRIDE_GFX_VERSION to make it work, but that's also trie for RX 6000, wich is working just fine.

Sooo.... Who knows.

HSA_OVERRIDE_GFX_VERSION is already enabled by default in webui.sh since a couple releases I think,

We can't even be really sure it's just RX 5000, maybe there are other series wich have problems but no one has reported it yet

Before this card, I ran SD on a RX 580 4GB which was a nightmare to get running. It didn't have this specific issue, but plenty of others problems that all boiled down to ROCm support.

magusman52 avatar Jun 07 '23 08:06 magusman52

HSA_OVERRIDE_GFX_VERSION is already forced though in the script for those cards - it was set correctly for me even when things weren't working. Perhaps Torch v2.0 needs a further workaround or something.

Yes, exactly. What i meant was that my first guess was about the HSA_OVERRIDE_GFX_VERSION causing problems, but that can't be because also the 6000 series uses that without issues.

DGdev91 avatar Jun 07 '23 09:06 DGdev91

Just out of curiosity, would there be any significant increase in performance on torch 2.0? Would be interesting to see someone on torch 2.0 with a 5700XT upload a benchmark, to compare to 1.13.1

magusman52 avatar Jun 08 '23 03:06 magusman52