diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

[OOM] Memory blows out when trying to upscale images larger than 128x128 using StableDiffusionUpscalePipeline

Open qunash opened this issue 2 years ago • 12 comments

Describe the bug

When trying to upscale images larger than 128x128 the progress goes to 100% and then crashes with CUDA OOM.

With 512x512 images it's trying to allocate 256.00 GiB!

Reproduction

import requests
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionUpscalePipeline
import torch

model_id = "stabilityai/stable-diffusion-x4-upscaler"
pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16)
pipeline = pipeline.to("cuda")

url = "https://www.freepnglogos.com/uploads/512x512-logo/512x512-transparent-circle-instagram-media-network-social-logo-new-16.png"
response = requests.get(url)
low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
prompt=""
upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
display(upscaled_image)

Logs

RuntimeError: CUDA out of memory. Tried to allocate 256.00 GiB (GPU 0; 14.76 GiB total capacity; 4.77 GiB already allocated; 8.28 GiB free; 5.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Info

  • diffusers version: 0.9.0
  • Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.15
  • PyTorch version (GPU?): 1.12.1+cu113 (True)
  • Huggingface_hub version: 0.11.0
  • Transformers version: 4.24.0

qunash avatar Nov 26 '22 10:11 qunash

I can reproduce this on MPS as well. It didn’t crash but created a ton of swap.

carson-katri avatar Nov 26 '22 20:11 carson-katri

Thank you for reporting this. The reason this happens is that your initial image gets bigger e.g. 512x512 the latent representations end up being 512 (latent dim) x 512 (H) x 512 (W) in the decoding attention block which gets reshaped into batch = 1, seq = 512 x 512 (H xW) and channel = 512 and that obviously will not work as vanilla attention is quadratic mem/compute in the seq length. Thus as you increase your initial image dims, the more mem. it will use.

A solution for now in the above code is to downsize the initial image to something manageable e.g.:

low_res_img = low_res_img.resize((128, 128))

As noted below you can also install xformers or try with attention slicing:

pipeline.enable_attention_slicing()
# pipeline.enable_xformers_memory_efficient_attention()

kashif avatar Nov 26 '22 20:11 kashif

Thank you for the explanation, I thought this may be the case. Resizing the image to 128x128 would produce a 512x512 image, correct?

carson-katri avatar Nov 26 '22 21:11 carson-katri

@carson-katri Yes correct!

kashif avatar Nov 26 '22 21:11 kashif

Any way to adapt the attention code to support larger images? The most common use case for this model would be upscaling outputs from SD.

monophthongal avatar Nov 26 '22 22:11 monophthongal

The model may support xformers and attention slicing, which could help I assume.

carson-katri avatar Nov 26 '22 22:11 carson-katri

@carson-katri correct you can try

pipeline.enable_attention_slicing()

and that should reduce some memory in exchange for a small speed decrease and enable larger inputs. With xformers installed it should be less as you point out!

kashif avatar Nov 26 '22 22:11 kashif

I'm working on a tile-based solution that runs the upscale model on small, overlapping patches of a larger source image and then merges them back into the full sized result. Much of the code is borrowed from realesrgan upscaler which supports this. Will try and publish code as soon as it's working

un1tz3r0 avatar Nov 27 '22 15:11 un1tz3r0

I implemented probably the most simplistic form of tiling possible here: https://github.com/carson-katri/dream-textures/blob/aa0132b42dd14ddbf9491c13a7a46a01da2c880a/generator_process/actions/upscale.py

I’m sure there are much better approaches that would limit seams. Perhaps just tiling the latent decoding process? Not entirely sure. Looking forward to seeing the improvements that will be made in this pipeline!

carson-katri avatar Nov 28 '22 03:11 carson-katri

This might be related / interesting: https://github.com/huggingface/diffusers/pull/1454

patrickvonplaten avatar Nov 30 '22 12:11 patrickvonplaten

Did anybody try using xformers? E.g. see: https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion#diffusers.StableDiffusionUpscalePipeline.enable_xformers_memory_efficient_attention

patrickvonplaten avatar Nov 30 '22 13:11 patrickvonplaten

I tried using it with xformers, i believe, and I think I got the same issue... i can re-run it... But the issue occurs in the creating of this empty tensor in the default attention block:

https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py#L331-L337

kashif avatar Nov 30 '22 13:11 kashif

you can upscale a 512x image with a ~20GB GPU (I didn´t try with less), with the linked PR & using xformers in the attentions in the VAE (when properly picked up by the enablement, hence another PR). I've this running just fine on a private fork, it looks like all the missing pieces are arriving here (see this PR) else I can PR the required missing bits

blefaudeux avatar Dec 01 '22 20:12 blefaudeux

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Dec 26 '22 15:12 github-actions[bot]

BTW, we should have better support for upscaling SD-1... once: https://github.com/huggingface/diffusers/pull/1321 is merged :-)

patrickvonplaten avatar Jan 03 '23 12:01 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jan 27 '23 15:01 github-actions[bot]

I know this is closed and these things are in docs, but just wanted to say that if you're running into this issue to install the following:

pip install xformers
pip install triton==2.0.0.dev20221120

And to add this to your pipeline:

import torch
from diffusers import DiffusionPipeline
from xformers.ops import MemoryEfficientAttentionFlashAttentionOp

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)
# Workaround for not accepting attention shape using VAE for Flash Attention
pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None)

eolszewski avatar Feb 08 '23 23:02 eolszewski