InvokeAI
InvokeAI copied to clipboard
[enhancement]: OOM error during VAE decode
Is there an existing issue for this?
- [X] I have searched the existing issues
OS
Linux
GPU
cuda
VRAM
24GB
What happened?
OOM error during VAE decode. VRAM usage skyrockets from ~15.5GB to ~40GB during decode for a 3072x3072 image.
Can we mitigate this somehow?
I tried calling self.enable-vae-slicing() in the constructor for StableDiffusionGeneratorPipeline but the numbers stayed the same.
>> Image Generation Parameters:
{'prompt': 'pizza', 'iterations': 3, 'steps': 3, 'cfg_scale': 7.5, 'threshold': 0, 'perlin': 0, 'height': 3072, 'width': 3072, 'sampler_name': 'k_lms', 'seed': 3471489041, 'progress_images': False, 'progress_latents': True, 'save_intermediates': 5, 'generation_mode': 'txt2img', 'init_mask': '...', 'hires_fix': False, 'seamless': False, 'variation_amount': 0}
>> ESRGAN Parameters: False
>> Facetool Parameters: False
100%|█████████████████████████████████████████████████████████| 3/3 [00:16<00:00, 5.58s/it]
Generating: 0%| | 0/3 [00:19<?, ?it/s]
Traceback (most recent call last):
File "/home/bat/Documents/Code/InvokeAI/ldm/generate.py", line 517, in prompt2image
results = generator.generate(
File "/home/bat/Documents/Code/InvokeAI/ldm/invoke/generator/base.py", line 112, in generate
image = make_image(x_T)
File "/home/bat/Documents/Code/InvokeAI/ldm/invoke/generator/txt2img.py", line 40, in make_image
pipeline_output = pipeline.image_from_embeddings(
File "/home/bat/Documents/Code/InvokeAI/ldm/invoke/generator/diffusers_pipeline.py", line 365, in image_from_embeddings
image = self.decode_latents(result_latents)
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 370, in decode_latents
image = self.vae.decode(latents).sample
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 144, in decode
decoded = self._decode(z).sample
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 116, in _decode
dec = self.decoder(z)
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/models/vae.py", line 188, in forward
sample = up_block(sample)
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 1718, in forward
hidden_states = upsampler(hidden_states)
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/models/resnet.py", line 139, in forward
hidden_states = self.conv(hidden_states)
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.50 GiB (GPU 0; 23.65 GiB total capacity; 14.41 GiB already allocated; 5.59 GiB free; 15.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
>> Could not generate image.
>> Usage stats:
>> 0 image(s) generated in 19.64s
>> Max VRAM used for this generation: 15.47G. Current VRAM utilization: 10.64G
>> Max VRAM used since script start: 15.47G
Screenshots
No response
Additional context
No response
Contact Details
No response
Some forensic analysis follows.
This issue happens in our call to decode_latents() but is actually an issue with diffusers' attention.py's AttentionBlock forward() method:
attention_scores = torch.baddbmm(
torch.empty(
query_proj.shape[0],
query_proj.shape[1],
key_proj.shape[1],
dtype=query_proj.dtype,
device=query_proj.device,
),
query_proj,
key_proj.transpose(-1, -2),
beta=0,
alpha=scale,
)
attention_probs = torch.softmax(attention_scores.float(), dim=-1).type(attention_scores.dtype)
In the test case of a 2048x1152 image, our underlying image tensor is [1, 4, 144, 256]. In this call, torch creates an empty tensor of [1, 36864, 36864] (allocating memory for that - 36864 is 144 * 256) and then baddbmm makes another allocation of [1, 36864, 36864]. The call to softmax is what actually causes the OOM error.
This is something that somebody on the InvokeAI team needs to take up with diffusers and/or torch.
baddbmm (Batched Add, Batched Matrix-Matrix product) is used to get the product of query_proj and key_proj while also multiplying it by scale in a single step.
The empty tensor is superfluous, as this calculation doesn't actually use the add part (beta=0), so if that were the bottleneck we could replace it with bmm(query_proj, key_proj...) * scale, replacing the single badamm operation with one bmm and one multiply. (Why does baddbmm accept an alpha scalar but bmm does not? No clue!)
However! You've let us know that the allocation of that empty tensor is not where it fails, and nothing retains a reference to it after the badbmm is done, so that might all have been a red herring.
I wouldn't label this bug. Operations on large amounts of data use large amounts of memory.
I think the thing you want is
- https://github.com/huggingface/diffusers/pull/1441
(I was hoping the "vae slicing" option was this, but it turns out those are two different things and this one hasn't been merged yet.)
The problems with tiled VAE as I see them are:
- Inconsistency between tiles (if we didn't care about that, we could just use embiggen)
- Requirement of xformers (and along with it comes its current nondeterministic behavior)
Allocating that empty tensor takes up the same amount of memory as baddbmm, and maybe that's what's pushing us over the edge. I'll investigate more.
I tested and ran out of VRAM on softmax even with separating this out into bmm and multiply. Everything gets converted to fp32.
There has been no activity in this issue for 14 days. If this issue is still being experienced, please reply with an updated confirmation that the issue is still being experienced with the latest release.
This is still something to pursue.
There has been no activity in this issue for 14 days. If this issue is still being experienced, please reply with an updated confirmation that the issue is still being experienced with the latest release.
Solution in #2920