stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: Huge VRAM consumption even with --lowvram

Open Photosounder opened this issue 1 year ago • 7 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

First of all I have a RTX 2000A 12 GB and I've been reduced to using --lowvram instead of --medvram and it still takes up all 12 GB of VRAM and even slows to such a crawl that I can get 60 seconds/iteration while making my computer hard to use. It's been happening for several weeks on the dev branch, it seems to get worse over time when I do batches, but even from the first image it takes up like 4 GB on a first 912x512 pass and 9 to 12 GB on the 1824x1024 hires fix step (txt2img). Even with more modest resolutions such as 512x640 which is upscaled to 1024x1280 my VRAM gets maxed out. I used to be able to inpaint at 1600x2048 with --medvram now it gives me out of memory errors even at 1024x1024.

Steps to reproduce the problem

  • Use --lowvram or --medvram
  • Prompt using hires fix at 2x
  • Set Batch count to 100
  • Monitor your VRAM usage in Task Manager

What should have happened?

I just tested with the last dev branch commit from May, same problem, then I tested with the last dev commit from April, the problem is gone, instead of taking 10 GB of VRAM it only takes 3 GB for the exact same thing. So something wrong happened in May.

Version or Commit where the problem happens

https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/0bf09c30c608ebe63fb154bd197e4baad978de63

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Windows

What device are you running WebUI on?

Nvidia GPUs (RTX 20 above)

Cross attention optimization

Automatic

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--lowvram --opt-split-attention --skip-torch-cuda-test --opt-sub-quad-attention

List of extensions

No

Console logs

venv "C:\msys\home\sd\sd\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: v1.4.0-48-gfab73f2e
Commit hash: fab73f2e7d388ca99cdb3d5de7f36c0b9a1a3b1c
Installing requirements

Launching Web UI with arguments: --lowvram --opt-split-attention --skip-torch-cuda-test --opt-sub-quad-attention
No module 'xformers'. Proceeding without it.
ControlNet v1.1.125
ControlNet v1.1.125
Loading weights [03df69045a] from C:\msys\home\sd\sd\models\Stable-diffusion\stablydiffuseds_26.safetensors
preload_extensions_git_metadata for 9 extensions took 0.45s
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 12.1s (import torch: 3.4s, import gradio: 1.8s, import ldm: 0.8s, other imports: 1.6s, setup codeformer: 0.1s, load scripts: 2.8s, create ui: 1.0s, gradio launch: 0.5s).
Creating model from config: C:\msys\home\sd\sd\configs\v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading VAE weights specified in settings: C:\msys\home\sd\sd\models\VAE\vae-ft-mse-840000-ema-pruned.safetensors
Applying attention optimization: Doggettx... done.
Textual inversion embeddings loaded(0):
Model loaded in 6.2s (load weights from disk: 2.1s, create model: 0.7s, apply weights to model: 1.4s, apply half(): 1.3s, load VAE: 0.3s, calculate empty prompt: 0.4s).
100%|██████████████████████████████████████████████████████████████████████████████████| 32/32 [01:51<00:00,  3.47s/it]
 44%|████████████████████████████████████▎                                              | 7/16 [01:24<01:38, 10.99s/it]
Total progress:   1%|▍                                                            | 39/4800 [03:23<13:59:13, 10.58s/it]

Additional information

No response

Photosounder avatar Jun 28 '23 16:06 Photosounder

I don't see any problem, that's how --lowvram works. Btw your model stablydiffuseds_26.safetensors is around 7.6Gb + vae-ft-mse-840000-ema-pruned.safetensors 0.4Gb and you use hires fix x2, what do you excpect? 😅

Try only set COMMANDLINE_ARGS=--opt-split-attention --skip-torch-cuda-test And try with different models around 2GB or 4GB

chrme avatar Jun 28 '23 19:06 chrme

I don't see any problem, that's how --lowvram works. Btw your model stablydiffuseds_26.safetensors is around 7.6Gb + vae-ft-mse-840000-ema-pruned.safetensors 0.4Gb and you use hires fix x2, what do you excpect? 😅

Try only set COMMANDLINE_ARGS=--opt-split-attention --skip-torch-cuda-test And try with different models around 2GB or 4GB

You think --lowvram works by taking up 12 GB of VRAM?!? I urge you to re-read my post in its entirety, you clearly misread it, and you will find that I compared the current dev commit with the last dev commit from May and the last dev commit from April and only the one from April took up a reasonable amount of VRAM, about half (or less) as much as the later commits.

Photosounder avatar Jun 28 '23 19:06 Photosounder

you will find that I compared the current dev commit with the last dev commit from May and the last dev commit from April and only the one from April took up a reasonable amount of VRAM, about half (or less) as much as the later commits.

So you can try using git-bisect to find which commit is causing this memory leak error. What is your nvidia driver version, pytorch version? Does the error persist with xformers? What upscaler is used in hires fix? Have you tried deleting all json configs?

Also you should pay attention to CHANGELOG or other issue like https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/10893#discussioncomment-6058130 because there may be an answer to your problem.

preload_extensions_git_metadata for 9 extensions took 0.45s

Btw, you said you don't have extensions, but the logs show that you have ControlNet and something else because 7 is builtin

chrme avatar Jun 28 '23 20:06 chrme

Apparently this is a problem with Doggettx being enabled by default since May 18th, see #10763. I don't know why that issue was closed since the problem remains.

Photosounder avatar Jun 28 '23 21:06 Photosounder

@Photosounder did you try using xformers instead?

dhwz avatar Jun 29 '23 05:06 dhwz

Since the problem seems to be "Cross attention optimization" I'll focus on testing that. It was set on Automatic which is Doggettx.

python.exe Dedicated GPU memory sizes during the hires fix phase at 1824x1024 and iteration speed during that phase only: xformers: 3,668,996 kB, 2.05s/it Doggettx: 8,211,496 kB, 5.66s/it sdp: 3,712,008 kB, 2.77s/it sdp-no-mem: 3,656,708 kB, 2.23s/it sub-quadratic: 5,086,224 kB, 7.76 s/it V1: OOM InvokeAI: 8,184,876 kB, 5.69s/it None: OOM, takes 8,000,524 kB just for the 912x512 phase

So clearly my problem is that since May 18th it defaults to Doggettx and it's one of the less acceptable options (the April 30th commit I was just using was clearly using sub-quadratic). I think that for the Automatic default the order of preference (by first available option) should be changed to:

  • sdp / sdp-no-mem
  • xformers
  • sub-quadratic
  • InvokeAI
  • Doggettx
  • V1
  • None

Btw has xformers become deterministic? I did the same prompt three times and got the exact same image everytime, that's surprising. If so then maybe it should be the first choice (but since it has to be enabled by command-line anyway it doesn't matter).

Photosounder avatar Jun 29 '23 15:06 Photosounder

These settings helped me fix my ram problems on a 3070:

--xformers 
--opt-sdp-no-mem-attention 
--opt-channelslast 
--upcast-sampling 
--no-half-vae
PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync

That, and using Tiled Diffusion and Tiled VAE.

demoran23 avatar Jun 30 '23 20:06 demoran23

These settings helped me fix my ram problems on a 3070:

--xformers 
--opt-sdp-no-mem-attention 
--opt-channelslast 
--upcast-sampling 
--no-half-vae
PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync

That, and using Tiled Diffusion and Tiled VAE.

Mine is a 3070 as well. It generates 2 images and then CUDA OUT OF MEMORY. At first, it was taking on average 6h to do what I was doing, now it is 20+. Had a lot of issues a lot of times. so COMMANDLINE_ARGS: --xformers --opt-sdp-no-mem-attention --opt-channelslast --upcast-sampling --no-half-vae PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync ?

I fear trying it because I had to reinstall Stable Diffusion a few times when I changed something in web-ui.

PsychoGarlic avatar Jul 04 '23 17:07 PsychoGarlic

Yep, It crashed.

PsychoGarlic avatar Jul 04 '23 18:07 PsychoGarlic

Btw has xformers become deterministic? I did the same prompt three times and got the exact same image everytime, that's surprising. If so then maybe it should be the first choice (but since it has to be enabled by command-line anyway it doesn't matter).

iirc, xformers is deterministic when using --opt-sdp-no-mem-attention

demoran23 avatar Jul 04 '23 18:07 demoran23