stablediffusion
stablediffusion copied to clipboard
Streamlit SD-Upscale x4, CUDA out of memory. Tried to allocate 400.00 GiB
Normally the CUDA oom is a normal thing with smaller GPUs but... 400GiB? I dont think that exists as a gpu so this is obviously a bug.
512x512 input. Goes through every ddim step before kaboom.
Using conda made with the environment yaml. Running on a 4090 machine.
full log:
Traceback (most recent call last): File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 556, in _run_script exec(code, module.__dict__) File "Z:\SD\SD_2.0\stablediffusion\scripts\streamlit\superresolution.py", line 170, in <module> run() File "Z:\SD\SD_2.0\stablediffusion\scripts\streamlit\superresolution.py", line 152, in run result = paint( File "Z:\SD\SD_2.0\stablediffusion\scripts\streamlit\superresolution.py", line 109, in paint x_samples_ddim = model.decode_first_stage(samples) File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "z:\sd\sd_2.0\stablediffusion\ldm\models\diffusion\ddpm.py", line 826, in decode_first_stage return self.first_stage_model.decode(z) File "z:\sd\sd_2.0\stablediffusion\ldm\models\autoencoder.py", line 90, in decode dec = self.decoder(z) File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "z:\sd\sd_2.0\stablediffusion\ldm\modules\diffusionmodules\model.py", line 631, in forward h = self.mid.attn_1(h) File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "z:\sd\sd_2.0\stablediffusion\ldm\modules\diffusionmodules\model.py", line 191, in forward w_ = torch.bmm(q,k) # b,hw,hw w[b,i,j]=sum_c q[b,i,c]k[b,c,j] RuntimeError: CUDA out of memory. Tried to allocate 400.00 GiB (GPU 0; 23.99 GiB total capacity; 6.47 GiB already allocated; 0 bytes free; 17.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
same here; trying to allocate 2304gib (which i checked my pockets, i don't have any in spare :P) trying to upscale
RuntimeError: CUDA out of memory. Tried to allocate 2304.00 GiB (GPU 0; 47.54 GiB total capacity; 10.90 GiB already allocated; 32.78 GiB free; 11.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
How to get the upsacle model? This address is not useable now. https://huggingface.co/stabilityai/stable-diffusion-2-depth/resolve/main/x4-upscaler-ema.ckpt
How to get the upsacle model? This address is not useable now.
Presumably you want this repo?
- https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/tree/main
With this file potentially?
- https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/blob/main/x4-upscaler-ema.ckpt
How to get the upsacle model? This address is not useable now.
Presumably you want this repo?
- https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/tree/main
With this file potentially?
- https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/blob/main/x4-upscaler-ema.ckpt
Thanks!
Coming back to the original issue, it actually isn't surprising to me that you get such memory issues, especially if you don't have xformers (I'm assuming there). In the Traceback it says the error occurs in "<>\stablediffusion\ldm\modules\diffusionmodules\model.py" at line 631, which is in the decoder. You're upsampling from 512x512, which means the Decoder gets a 512x512 input, and he applies attention to it (at least once). The attention matrix thus computed should be of size 512²x512², multiply this by 4 bytes per float32, you get a theoretical 275GB tensor. Not quite the values mentionned above but roughly the same order of magnitude. Hope this helps and please tell me if I said smth wrong :)
Coming back to the original issue, it actually isn't surprising to me that you get such memory issues, especially if you don't have xformers (I'm assuming there). In the Traceback it says the error occurs in "<>\stablediffusion\ldm\modules\diffusionmodules\model.py" at line 631, which is in the decoder. You're upsampling from 512x512, which means the Decoder gets a 512x512 input, and he applies attention to it (at least once). The attention matrix thus computed should be of size 512²x512², multiply this by 4 bytes per float32, you get a theoretical 275GB tensor. Not quite the values mentionned above but roughly the same order of magnitude. Hope this helps and please tell me if I said smth wrong :)
I was using xformers! also isnt this based on similar model info to LDSR? I have been using that so much to upscale my SD gens, 512x512 at least, without xformers. So I have 2 thoughts on why this is happening:
- The code isnt really functioning right since this only happens after all the generation steps, so when decoding the image its doing something funky
- The model is basically useless and was only made to go from 128->512 for some odd reason.
Try adding this to your .bashrc file:
export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128
And the sourcing it by running:
source ~/.bashrc
Well for some reason I couldn't find the config file for LDSR, so I downloaded the model ckpt and checked the 'state_dict', and there does not seem to be attention in the decoder.
In the Decode class from "<>\stablediffusion\ldm\modules\diffusionmodules\model.py", there is the attn_type
kwarg which I guess is set to 'none' in LDSR ? But given the Traceback you got there is definitely some attention going on.
Btw 2 things I did not say in my first message:
- I did manage to infer on higher res thanks to xformers. On a 3090 I was able to do 256->1024. 512->2048 was way too much for me.
- I'm running the gradio script, no idea about the differences with the streamlit one.
Looking into SDv2's config file, there is the comment:
# attn_type: "vanilla-xformers" this model needs efficient attention to be feasible on HR data, also the decoder seems to break in half precision (UNet is fine though)
Your xformers setup might not work ? Maybe try checking the value of XFORMERS_IS_AVAILABLE
in "<>\stablediffusion\ldm\modules\diffusionmodules\model.py".
Hope this helps
Edit: checking into SDv2's 'stat_dict', I do find attention layers in the decoder this time
Xformers does work, I was using it with the main sampler since it is required for 768 on 4090, it also lists the xformer overwrites in console. Cant do any bashrc since this is a windows machine, not everyone runs linux.
Mine tried to allocate 900.00 GiB after reaching 100%. No difference with or without Xformers for me. Running it on colab with the following command:
`!python scripts/gradio/superresolution.py configs/stable-diffusion/x4-upscaling.yaml '/content/drive/MyDrive/AI/models/x4-upscaler-ema.ckpt'
PS: txt2img works fine with the setup
This is actually an issue with the decode first stage. Any high res image uses an excessive amount of vram. I can actually encode a tensor of 960x704 but can’t decode the result since I run out of vram. This is also an issue with img2img. After about 2048 it can still encode but decode needs WAY too much vram. I was considering testing using ldsr ckpt/decode to see if I can get an image out that way. Just sucks it would have to load 2 models to do it.
Also attempted to allocate 2034GB when upscaling 768x768 image. My reading of the above discussion is that this is not surprising given the model architecture.
I guess i'm punting on the superresolution until I have time to dig a little deeper on it
As a professional developer, I don't release something if it doesn't work. Unless your intended users are people running massive server farms.