stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[Bug]: Huge VRAM consumption even with --lowvram
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What happened?
First of all I have a RTX 2000A 12 GB and I've been reduced to using --lowvram
instead of --medvram
and it still takes up all 12 GB of VRAM and even slows to such a crawl that I can get 60 seconds/iteration while making my computer hard to use. It's been happening for several weeks on the dev
branch, it seems to get worse over time when I do batches, but even from the first image it takes up like 4 GB on a first 912x512 pass and 9 to 12 GB on the 1824x1024 hires fix step (txt2img). Even with more modest resolutions such as 512x640 which is upscaled to 1024x1280 my VRAM gets maxed out. I used to be able to inpaint at 1600x2048 with --medvram
now it gives me out of memory errors even at 1024x1024.
Steps to reproduce the problem
- Use
--lowvram
or--medvram
- Prompt using hires fix at 2x
- Set Batch count to 100
- Monitor your VRAM usage in Task Manager
What should have happened?
I just tested with the last dev
branch commit from May, same problem, then I tested with the last dev
commit from April, the problem is gone, instead of taking 10 GB of VRAM it only takes 3 GB for the exact same thing. So something wrong happened in May.
Version or Commit where the problem happens
https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/0bf09c30c608ebe63fb154bd197e4baad978de63
What Python version are you running on ?
Python 3.10.x
What platforms do you use to access the UI ?
Windows
What device are you running WebUI on?
Nvidia GPUs (RTX 20 above)
Cross attention optimization
Automatic
What browsers do you use to access the UI ?
Google Chrome
Command Line Arguments
--lowvram --opt-split-attention --skip-torch-cuda-test --opt-sub-quad-attention
List of extensions
No
Console logs
venv "C:\msys\home\sd\sd\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: v1.4.0-48-gfab73f2e
Commit hash: fab73f2e7d388ca99cdb3d5de7f36c0b9a1a3b1c
Installing requirements
Launching Web UI with arguments: --lowvram --opt-split-attention --skip-torch-cuda-test --opt-sub-quad-attention
No module 'xformers'. Proceeding without it.
ControlNet v1.1.125
ControlNet v1.1.125
Loading weights [03df69045a] from C:\msys\home\sd\sd\models\Stable-diffusion\stablydiffuseds_26.safetensors
preload_extensions_git_metadata for 9 extensions took 0.45s
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Startup time: 12.1s (import torch: 3.4s, import gradio: 1.8s, import ldm: 0.8s, other imports: 1.6s, setup codeformer: 0.1s, load scripts: 2.8s, create ui: 1.0s, gradio launch: 0.5s).
Creating model from config: C:\msys\home\sd\sd\configs\v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading VAE weights specified in settings: C:\msys\home\sd\sd\models\VAE\vae-ft-mse-840000-ema-pruned.safetensors
Applying attention optimization: Doggettx... done.
Textual inversion embeddings loaded(0):
Model loaded in 6.2s (load weights from disk: 2.1s, create model: 0.7s, apply weights to model: 1.4s, apply half(): 1.3s, load VAE: 0.3s, calculate empty prompt: 0.4s).
100%|██████████████████████████████████████████████████████████████████████████████████| 32/32 [01:51<00:00, 3.47s/it]
44%|████████████████████████████████████▎ | 7/16 [01:24<01:38, 10.99s/it]
Total progress: 1%|▍ | 39/4800 [03:23<13:59:13, 10.58s/it]
Additional information
No response
I don't see any problem, that's how --lowvram works. Btw your model stablydiffuseds_26.safetensors
is around 7.6Gb + vae-ft-mse-840000-ema-pruned.safetensors
0.4Gb and you use hires fix x2, what do you excpect? 😅
Try only set COMMANDLINE_ARGS=--opt-split-attention --skip-torch-cuda-test
And try with different models around 2GB or 4GB
I don't see any problem, that's how --lowvram works. Btw your model
stablydiffuseds_26.safetensors
is around 7.6Gb +vae-ft-mse-840000-ema-pruned.safetensors
0.4Gb and you use hires fix x2, what do you excpect? 😅Try only
set COMMANDLINE_ARGS=--opt-split-attention --skip-torch-cuda-test
And try with different models around 2GB or 4GB
You think --lowvram
works by taking up 12 GB of VRAM?!? I urge you to re-read my post in its entirety, you clearly misread it, and you will find that I compared the current dev
commit with the last dev
commit from May and the last dev
commit from April and only the one from April took up a reasonable amount of VRAM, about half (or less) as much as the later commits.
you will find that I compared the current dev commit with the last dev commit from May and the last dev commit from April and only the one from April took up a reasonable amount of VRAM, about half (or less) as much as the later commits.
So you can try using git-bisect to find which commit is causing this memory leak error. What is your nvidia driver version, pytorch version? Does the error persist with xformers? What upscaler is used in hires fix? Have you tried deleting all json configs?
Also you should pay attention to CHANGELOG or other issue like https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/10893#discussioncomment-6058130 because there may be an answer to your problem.
preload_extensions_git_metadata for 9 extensions took 0.45s
Btw, you said you don't have extensions, but the logs show that you have ControlNet and something else because 7 is builtin
Apparently this is a problem with Doggettx being enabled by default since May 18th, see #10763. I don't know why that issue was closed since the problem remains.
@Photosounder did you try using xformers instead?
Since the problem seems to be "Cross attention optimization" I'll focus on testing that. It was set on Automatic
which is Doggettx
.
python.exe Dedicated GPU memory sizes during the hires fix phase at 1824x1024 and iteration speed during that phase only:
xformers
: 3,668,996 kB, 2.05s/it
Doggettx
: 8,211,496 kB, 5.66s/it
sdp
: 3,712,008 kB, 2.77s/it
sdp-no-mem
: 3,656,708 kB, 2.23s/it
sub-quadratic
: 5,086,224 kB, 7.76 s/it
V1
: OOM
InvokeAI
: 8,184,876 kB, 5.69s/it
None
: OOM, takes 8,000,524 kB just for the 912x512 phase
So clearly my problem is that since May 18th it defaults to Doggettx and it's one of the less acceptable options (the April 30th commit I was just using was clearly using sub-quadratic
). I think that for the Automatic
default the order of preference (by first available option) should be changed to:
- sdp / sdp-no-mem
- xformers
- sub-quadratic
- InvokeAI
- Doggettx
- V1
- None
Btw has xformers become deterministic? I did the same prompt three times and got the exact same image everytime, that's surprising. If so then maybe it should be the first choice (but since it has to be enabled by command-line anyway it doesn't matter).
These settings helped me fix my ram problems on a 3070:
--xformers
--opt-sdp-no-mem-attention
--opt-channelslast
--upcast-sampling
--no-half-vae
PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
That, and using Tiled Diffusion and Tiled VAE.
These settings helped me fix my ram problems on a 3070:
--xformers --opt-sdp-no-mem-attention --opt-channelslast --upcast-sampling --no-half-vae
PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
That, and using Tiled Diffusion and Tiled VAE.
Mine is a 3070 as well. It generates 2 images and then CUDA OUT OF MEMORY. At first, it was taking on average 6h to do what I was doing, now it is 20+. Had a lot of issues a lot of times. so COMMANDLINE_ARGS: --xformers --opt-sdp-no-mem-attention --opt-channelslast --upcast-sampling --no-half-vae PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync ?
I fear trying it because I had to reinstall Stable Diffusion a few times when I changed something in web-ui.
Yep, It crashed.
Btw has xformers become deterministic? I did the same prompt three times and got the exact same image everytime, that's surprising. If so then maybe it should be the first choice (but since it has to be enabled by command-line anyway it doesn't matter).
iirc, xformers is deterministic when using --opt-sdp-no-mem-attention