diffusers Inference is slower with SD2.0 and memory_efficient

Describe the bug

The release notes claim that using enable_xformers_memory_efficient_attention makes inference faster.

In my tests, this is true when using SD 1.5, but false when using SD 2.0.

Some timings (in seconds) from my machine (using a Quadro RTX 8000), using Euler scheduler and 30 steps with the StableDiffusionPipeline:

model	resolution	`disable_xformers_memory_efficient_attention()`	`enable_xformers_memory_efficient_attention()`
SD1.5	512x512	2.3	1.9
SD2.0	512x512	1.9	2.6
SD2.0	576x576	3.0	3.8
SD2.0	640x640	3.5	4.9
SD2.0	704x704	4.8	6.7
SD2.0	768x768	5.5	8.3

So it looks like using xformer's memory-efficient attention provides a speed benefit with SD 1.5, but the effect is negative with SD 2.0, regardless of the resolution.

I would recommend performance tests to catch these kinds of issues.

Reproduction

Run StableDiffusionPipeline as described above.

Logs

No response

System Info

diffusers-0.9.0
transformers-4.24.0
xformers-0.0.13
Quadro RTX 8000 48GB RAM
Linux version 4.14.240

Dec 06 '22 22:12 antoche

Adding that at least with SD 2.1, 768 resolution, and a 3090, enable_xformers_memory_efficient_attention does improve performance. Providing a ~2.5x speed increase vs xformers disabled.

Dec 12 '22 05:12 mlmcgoogan

Maybe cc @pcuenca here - we really need (semi-)official wheels for xformers . Xformers should surely give a speed boost across the board

Dec 15 '22 20:12 patrickvonplaten

@mlmcgoogan @patrickvonplaten that's been my experience too.

@antoche I've seen similar improvements on 3090 as those reported by @mlmcgoogan, and we have also tested on V100, A100 and others. We don't have the resources to test on every card, but my understanding is that RTX 8000 should also benefit from improvements. I see that your xformers version is 0.0.13, while I've been testing with 0.0.15 using this process: https://github.com/huggingface/diffusers/pull/1724/commits/af10c35b0e4dda96dae572635c140cb88b24e839

Could that be the reason why?

Dec 16 '22 12:12 pcuenca

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 09 '23 15:01 github-actions[bot]

Sorry I've been away for a while, picking this up again. I haven't got my hands on xformers-0.0.15 yet, but my latest tests with diffusers-0.14.0 show:

"runwayml/stable-diffusion-v1-5":
- with: 16.61it/s
- without: 14.13it/s
"stabilityai/stable-diffusion-2":
- with: 3.84it/s
- without: 5.78it/s
"stabilityai/stable-diffusion-2-1":
- with: 3.86it/s
- without: 2.96it/s

Measured using this this simple snippet:

pipe = StableDiffusionPipeline.from_pretrained(model, revision="fp16", torch_dtype=torch.float16).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe(prompt="foo", num_inference_steps=20)
pipe.disable_xformers_memory_efficient_attention()
pipe(prompt="foo", num_inference_steps=20)

It's strange that it's a negative effect on 2.0 but positive on 2.1, I thought those had the exact same architecture and param count.

I'll try with xformers-0.0.15 asap.

Mar 15 '23 03:03 antoche

I just tested with xformers-0.0.16 and diffusers-0.14.0. Speed is now roughly the same with and without memory-efficient attention on all 3 models.

I am still really puzzled by the fact SD 2.1 is about 2x slower than SD 2.0 even though both models appear to have the exact same size.

Mar 23 '23 01:03 antoche

Hey @antoche, the reason that SD 2.1 is slower is because it has to use upscale the attentions during inference as otherwise the model will generate nan's see: https://github.com/huggingface/diffusers/blob/2ef9bdd76f69dfe7a6c125a3d76222140c685557/src/diffusers/models/attention_processor.py#L234

2.0 doesn't suffer from exploding values so it doesn't need this hack

However this is not necessary when using xformers or when using Torch 2.0 - there the speed should be the same

Mar 23 '23 13:03 patrickvonplaten

Ok, it sounds like there is still something to investigate, then. Should this ticket be re-opened?

Could someone else confirm the timings with the same versions?

Are there performance tests in the test suite comparing inference speed with and without memory_efficient_attention?

Mar 23 '23 22:03 antoche

Hmm no I think this is expected. If you upcast attention to fp32 then the attention computation will significantly slow things down. In my experience when using Torch 2.0 SD2.0 and SD2.1 are equally fast.

Mar 27 '23 18:03 patrickvonplaten

Hey @antoche, the reason that SD 2.1 is slower is because it has to use upscale the attentions during inference as otherwise the model will generate nan's see:

https://github.com/huggingface/diffusers/blob/2ef9bdd76f69dfe7a6c125a3d76222140c685557/src/diffusers/models/attention_processor.py#L234

2.0 doesn't suffer from exploding values so it doesn't need this hack

However this is not necessary when using xformers or when using Torch 2.0 - there the speed should be the same

Hi @patrickvonplaten I also came into the issue that using xformers is slower when I use xformers==0.0.26 on SDXL base model. I don't know how get off the hack to bring the benefit back? Would you mind to give some advice?

Jul 16 '24 07:07 bigmover

Hi, you're posting on a thread that started in 2022 with a really old version of diffusers. Even if you take into account the last comment, it's more than a year old and Patrick is not longer a maintainer of this repository.

Please post a new issue with your problem and your environment, this discussion doesn't apply to the actual version of diffusers, xformers or pytorch.

A quick solution is to Just use a torch version greater than 2.0 and don't use xformers, even if you can't, you should start migrating your code to be able to use it.

Jul 16 '24 09:07 asomoza

A quick solution is to Just use a torch version greater than 2.0 and don't use xformers, even if you can't, you should start migrating your code to be able to use it.

@asomoza Thank you for your reply. I post a new issue https://github.com/huggingface/diffusers/issues/8873. Would you mind to give some advice?

Jul 16 '24 11:07 bigmover

diffusers
diffusers copied to clipboard

Inference is slower with SD2.0 and memory_efficient_attention

Describe the bug

Reproduction

Logs

System Info

diffusers diffusers copied to clipboard

Inference is slower with SD2.0 and memory_efficient_attention

Describe the bug

Reproduction

Logs

System Info

diffusers
diffusers copied to clipboard