stablediffusion CUDA out of memory? (3080ti 12GB)

I just installed Stable Diffusion 2.0 on my Linux box and it's sort of working.

I keep getting "CUDA out of memory" errors. When using the txt2img example I had to decrease the resolution to 384x384 to avoid a crash.

With the x4 upscaler web interface I always end with a crash like: CUDA out of memory. Tried to allocate 2.81 GiB (GPU 0; 11.77 GiB total capacity; 7.84 GiB already allocated...

I tried setting PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64 (and other values) which allowed me to increase the txt2img resoluiton to 512x512, but x4 upscaler still crashes.

Is this normal behavior or do I simply need more VRAM?

My machine is 5900X, 32GB RAM, 3080ti 12GB, Pop!_OS 22.04 LTS.

Nov 27 '22 14:11 dacobi

i have same problem, 5800x3d ,32gb ram, 3080 ti, and out of memory.

Nov 27 '22 15:11 vision34

I have A4000 here. Im getting images with resolution up to 1856x1024 with text2img and with img2img as well. With or without attention-slicing. I suppose that xformers help here a lot.

However with the example for upsample my system fails miserably.

The maximum resolution of the initial image without crash of the pipline: 248x248, - resulting image 768x768.
With 256x256 - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 15.74 GiB total capacity; 10.19 GiB already allocated; 4.88 GiB free; 10.22 GiB reserved in total by PyTorch)

But this is not the worse part. The example for upsampler always generates a black image. It does not depend on the initial image resolution. I always get the black image here.

CUDA_MODULE_LOADING=LAZY is set.

I have played with PYTORCH_CUDA_ALLOC_CONF (it makes no sense to go bellow 128mb) parameters and with or without enable_attention_slicing().

xformers are installed and seems to be working, but I can't be sure xformers are functioning properly and I have no clue how to test them. I just started to learn python and all the ecosystem related to SD.

One thing is for sure not available "torch.ops.xformers.efficient_attention_forward_generic" but the torch.ops.xformers.efficient_attention_forward_cutlass is there. How do I check which of the modules is in use?

How can I find out where did I messed my toolchain installation? I suspect that I messed up something with xformers. I had to install them from the git repo and had to use torch==1.13.0+cu117 torchvision==0.14.0+cu117 from https://download.pytorch.org/whl/cu117 to make it work.

Nov 27 '22 17:11 pkpro

Hi, I see 2 reasons inference would work with txt2img and img2img, and not with upscaling.

txt2img and img2img have 4 channel multipliers in their configs, meaning they downscale 3 times (x8 lower res latent). As I (tried to) explain(ed) here #5, the Encoder/Decoder classes have attention in their smallest scale, which is usually the memory bottleneck. But in the cases of txt2img and img2img it is fine because we have 3 downscales, meaning your 1856x1024 inference resulted in a 232x128 attention. In the case of super res, there is only 3 channel multipliers, meaning x4 upscaling :eyes: which would require your 248x248 inference to perform attention on a 248x248 sample.
The upscaling model has issues in the Encoder/Decoder, where it just breaks in half precision (#44 + cf upscaling config). In the txt2img and img2img scripts, decoding (and encoding) is done within the torch.autocast scope.

I'm sorry I can't provide more than attempts at explaining things. On my side, w/ a 3090 (24GB) I was able to do inference at 256x256 (only with xformers)(I could probably push it but no use), and quickly looking at nvidia-smi during inference showed it required 18GB~ (A4000 has 16GB I believe?).

Hope that's clear, please tell me if I said smth wrong.

Nov 29 '22 08:11 ThibaultLSDC

First of all thanks for the info @ThibaultLSDC. And yes your point of view on the VRAM requirements seems to be valid. 3 downscales as 2^3 = x8 downscale, and yes 248x248 limit is kinda clear too.

Removing 'with torch.autocast("cuda"):' from example (I probably have put it there myself out of habit), fixed problem with black images.

Though this x4 upscaler does not seem to be very practical on my HW it is a good start. It adds rainbow noise. Just a little bit, but still you can see it. Especially if you resize it later on. Can this be fixed with another scheduler?

Nov 30 '22 14:11 pkpro

I'm running x4 scaling on a V100 (40GB) without xformers and can run on 256x256 inputs but fail on 512x512 inputs with this error:

RuntimeError: CUDA out of memory. Tried to allocate 256.00 GiB (GPU 0; 39.41 GiB total capacity; 5.84 GiB already allocated; 31.24 GiB free; 6.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Dec 02 '22 11:12 ArielReplicate

Same on a 3090 24Gb,

txt2img

~~Using (768-v-ema.ckpt) 512px (default) works great, but with H/W to 768px it returns:~~

RuntimeError: CUDA out of memory. 
Tried to allocate 9.49 GiB (
	GPU 0; 
	24.00 GiB total capacity; 
	9.88 GiB already allocated;
	2.15 GiB free; 
	19.38 GiB reserved in total by PyTorch
) 

If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

No actually this was just user error, the batch size is 3 by default, using 2 works perfectly fine.

Gradio/superresolution

And upscaling (x4-upscaler-ema.ckpt) errors out after DDIM Sampling as it tries to allocate 256gb of memory?

RuntimeError: CUDA out of memory. 
Tried to allocate 256.00 GiB (
	GPU 0; 
	24.00 GiB total capacity; 
	5.84 GiB already allocated; 
	15.37 GiB free; 
	6.16 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Dec 03 '22 20:12 melMass

Same here, do we have any hardware requirement specs for x4 upscaler ?

Dec 06 '22 18:12 nguyenmeteorops

On my headless Windows 10 machine with 3070 (8GB, Overclocked) and 16GB Memory, I am able to generate upto 2048x1600 in txt2img (2.13s/it). Automatic 1111 44c46f0, Studio Driver 527.56. For some reason, It was not possible using SD 2.0! But it's hard to get good results at high resolutions.

Steps: 25, Sampler: DDIM, CFG scale: 7.296999, Seed: 2093474288, Size: 2048x1600, Model hash: e1542d5a, Model: v2-1_768-nonema-pruned

PS C:\Users\AILab> nvidia-smi
Fri Dec  9 00:10:40 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 527.56       Driver Version: 527.56       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0  On |                  N/A |
| 34%   65C    P2   219W / 220W |   6386MiB /  8192MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1504    C+G   ...dows\System32\LogonUI.exe    N/A      |
|    0   N/A  N/A      1516    C+G   C:\Windows\System32\dwm.exe     N/A      |
|    0   N/A  N/A      8556      C   C:\Python310\python.exe         N/A      |
+-----------------------------------------------------------------------------+

Edit: After turning on High performance mode in the driver, got 1.89s/it @ 2048x1600px

DDIM Sampler: 100%|████████████████████████████| 25/25 [00:47<00:00, 1.89s/it]

Dec 08 '22 20:12 ataa

stablediffusion stablediffusion copied to clipboard

CUDA out of memory? (3080ti 12GB)

~~txt2img~~

Gradio/superresolution

stablediffusion
stablediffusion copied to clipboard

txt2img