InvokeAI icon indicating copy to clipboard operation
InvokeAI copied to clipboard

[enhancement]: OOM error during VAE decode

Open psychedelicious opened this issue 2 years ago • 4 comments
trafficstars

Is there an existing issue for this?

  • [X] I have searched the existing issues

OS

Linux

GPU

cuda

VRAM

24GB

What happened?

OOM error during VAE decode. VRAM usage skyrockets from ~15.5GB to ~40GB during decode for a 3072x3072 image.

Can we mitigate this somehow?

I tried calling self.enable-vae-slicing() in the constructor for StableDiffusionGeneratorPipeline but the numbers stayed the same.

>> Image Generation Parameters:

{'prompt': 'pizza', 'iterations': 3, 'steps': 3, 'cfg_scale': 7.5, 'threshold': 0, 'perlin': 0, 'height': 3072, 'width': 3072, 'sampler_name': 'k_lms', 'seed': 3471489041, 'progress_images': False, 'progress_latents': True, 'save_intermediates': 5, 'generation_mode': 'txt2img', 'init_mask': '...', 'hires_fix': False, 'seamless': False, 'variation_amount': 0}

>> ESRGAN Parameters: False
>> Facetool Parameters: False
100%|█████████████████████████████████████████████████████████| 3/3 [00:16<00:00,  5.58s/it]
Generating:   0%|                                                     | 0/3 [00:19<?, ?it/s]
Traceback (most recent call last):
  File "/home/bat/Documents/Code/InvokeAI/ldm/generate.py", line 517, in prompt2image
    results = generator.generate(
  File "/home/bat/Documents/Code/InvokeAI/ldm/invoke/generator/base.py", line 112, in generate
    image = make_image(x_T)
  File "/home/bat/Documents/Code/InvokeAI/ldm/invoke/generator/txt2img.py", line 40, in make_image
    pipeline_output = pipeline.image_from_embeddings(
  File "/home/bat/Documents/Code/InvokeAI/ldm/invoke/generator/diffusers_pipeline.py", line 365, in image_from_embeddings
    image = self.decode_latents(result_latents)
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 370, in decode_latents
    image = self.vae.decode(latents).sample
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 144, in decode
    decoded = self._decode(z).sample
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 116, in _decode
    dec = self.decoder(z)
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/models/vae.py", line 188, in forward
    sample = up_block(sample)
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 1718, in forward
    hidden_states = upsampler(hidden_states)
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/diffusers/models/resnet.py", line 139, in forward
    hidden_states = self.conv(hidden_states)
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/bat/invokeai/.venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.50 GiB (GPU 0; 23.65 GiB total capacity; 14.41 GiB already allocated; 5.59 GiB free; 15.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

>> Could not generate image.

>> Usage stats:
>>   0 image(s) generated in 19.64s
>>   Max VRAM used for this generation: 15.47G. Current VRAM utilization: 10.64G
>>   Max VRAM used since script start:  15.47G

Screenshots

No response

Additional context

No response

Contact Details

No response

psychedelicious avatar Feb 15 '23 12:02 psychedelicious

Some forensic analysis follows.

This issue happens in our call to decode_latents() but is actually an issue with diffusers' attention.py's AttentionBlock forward() method:

            attention_scores = torch.baddbmm(
                torch.empty(
                    query_proj.shape[0],
                    query_proj.shape[1],
                    key_proj.shape[1],
                    dtype=query_proj.dtype,
                    device=query_proj.device,
                ),
                query_proj,
                key_proj.transpose(-1, -2),
                beta=0,
                alpha=scale,
            )
            attention_probs = torch.softmax(attention_scores.float(), dim=-1).type(attention_scores.dtype)

In the test case of a 2048x1152 image, our underlying image tensor is [1, 4, 144, 256]. In this call, torch creates an empty tensor of [1, 36864, 36864] (allocating memory for that - 36864 is 144 * 256) and then baddbmm makes another allocation of [1, 36864, 36864]. The call to softmax is what actually causes the OOM error.

This is something that somebody on the InvokeAI team needs to take up with diffusers and/or torch.

JPPhoto avatar Feb 15 '23 15:02 JPPhoto

baddbmm (Batched Add, Batched Matrix-Matrix product) is used to get the product of query_proj and key_proj while also multiplying it by scale in a single step.

The empty tensor is superfluous, as this calculation doesn't actually use the add part (beta=0), so if that were the bottleneck we could replace it with bmm(query_proj, key_proj...) * scale, replacing the single badamm operation with one bmm and one multiply. (Why does baddbmm accept an alpha scalar but bmm does not? No clue!)

However! You've let us know that the allocation of that empty tensor is not where it fails, and nothing retains a reference to it after the badbmm is done, so that might all have been a red herring.

I wouldn't label this bug. Operations on large amounts of data use large amounts of memory.

I think the thing you want is

  • https://github.com/huggingface/diffusers/pull/1441

(I was hoping the "vae slicing" option was this, but it turns out those are two different things and this one hasn't been merged yet.)

keturn avatar Feb 16 '23 21:02 keturn

The problems with tiled VAE as I see them are:

  • Inconsistency between tiles (if we didn't care about that, we could just use embiggen)
  • Requirement of xformers (and along with it comes its current nondeterministic behavior)

Allocating that empty tensor takes up the same amount of memory as baddbmm, and maybe that's what's pushing us over the edge. I'll investigate more.

JPPhoto avatar Feb 17 '23 13:02 JPPhoto

I tested and ran out of VRAM on softmax even with separating this out into bmm and multiply. Everything gets converted to fp32.

JPPhoto avatar Feb 17 '23 15:02 JPPhoto

There has been no activity in this issue for 14 days. If this issue is still being experienced, please reply with an updated confirmation that the issue is still being experienced with the latest release.

github-actions[bot] avatar Mar 05 '23 06:03 github-actions[bot]

This is still something to pursue.

psychedelicious avatar Mar 05 '23 20:03 psychedelicious

There has been no activity in this issue for 14 days. If this issue is still being experienced, please reply with an updated confirmation that the issue is still being experienced with the latest release.

github-actions[bot] avatar Mar 21 '23 06:03 github-actions[bot]

Solution in #2920

psychedelicious avatar Mar 21 '23 08:03 psychedelicious