diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Up to 2x speedup on GPUs using memory efficient attention

Open MatthieuToulemont opened this issue 1 year ago • 67 comments

Why ?

While stable diffusion democratized the access to text to image generative models, it can still be relatively long to generate an image on consumer GPUs. The GPU memory requirements are also hindering the use of diffusion on small GPUs.

How ?

Recent work on optimizing the bandwitdh in the attention block have generated huge speed ups and gains in GPU memory usage. The most recent being Flash Attention (from @tridao, code, paper) .

In this PR we use the MemoryEfficientAttention implementation from xformers (cc. @fmassa, @danthe3rd, @blefaudeux) to both speedup the cross-attention speed and decrease its GPU memory requirements.

The memory efficient attention can be activated by setting the environment variable USE_MEMORY_EFFICIENT_ATTENTION=1 and installing the xformers library:

pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers

This installation is a known pain point, there are two ways to improve that:

  • xformers ships wheels
  • this dependency could be made optional in this repository

Thank you @tridao, @fmassa, @danthe3rd for the work on Flash Attention and its integration in xformers. Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion. @blefaudeux and I would be happy to contribute :)

Note:

  • Masked cross-attention is not supported yet with this implementation but could be added in future work.
  • This PR does not solve the xformers dependency just yet, any help appreciated

Speedups on various GPUs with a 512x512 shape and running FP16:

GPU Base Attention FP16 Memory Efficient Attention FP16
NVIDIA Tesla T4 3.5it/s 5.5it/s
NVIDIA 3060 RTX 4.6it/s 7.8it/s
NVIDIA A10G 8.88it/s 15.6it/s
NVIDIA RTX A6000 11.7it/s 21.09it/s
A100-SXM-80GB 18.7it/s 27.5it/s

How to test:

I use the following setup:

sudo docker run -it --gpus=all --ipc=host -v /home:/home nvcr.io/nvidia/pytorch:22.08-py3 bash

# Then 
pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers
pip install transformers ftfy scipy

# Followed by
cd PATH_TO_DIFFUSER_FOLDER
git checkout memory_efficient_attention
pip install -e . 

Then create a python file, mine is named test.py, with the following code:

import torch
from diffusers import StableDiffusionPipeline


pipe = StableDiffusionPipeline.from_pretrained(
   "CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16, use_auth_token=True
).to("cuda")

with torch.inference_mode(), torch.autocast("cuda"):
   image = pipe("a small cat")

Then run in the aforementioned docker container

# Test without Memory Efficient Attention: 
python test.py

# Test with Memory Efficient Attention: 
USE_MEMORY_EFFICIENT_ATTENTION=1 python test.py

MatthieuToulemont avatar Sep 16 '22 14:09 MatthieuToulemont

Hi @MatthieuTPHR ,

Nice PR!

On xFormers side, we are working on improving the packaging so that it can be more easily installed by users, while shipping the pre-compiled binaries as well.

We are also continuing to optimize the kernel for some configurations, we will keep the K=40 in mind for the future.

fmassa avatar Sep 16 '22 15:09 fmassa

And about more optimized kernels for K=40, @danthe3rd has been looking very closely on further optimizations and has some ideas for optimizing our current kernels for smaller K, I'll let him chime in but contributions are more than welcome!

fmassa avatar Sep 16 '22 15:09 fmassa

I will put this PR in draft until the dependencies issues are solved

MatthieuToulemont avatar Sep 16 '22 15:09 MatthieuToulemont

Hey @MatthieuTPHR,

Thanks a lot for opening the PR - it looks very cool! Trying it out now :-)

Generally we're quite careful with not adding new dependencies to diffusers, but I think we might be able to make it a soft-dependency if the speed-up is big enough!

patrickvonplaten avatar Sep 16 '22 15:09 patrickvonplaten

Hi @MatthieuTPHR - this looks like a great improvement!

Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion. @blefaudeux and I would be happy to contribute :)

We've been improving the forward (including fairly recently, in https://github.com/facebookresearch/xformers/pull/388 for instance). Do you mind sharing the other parameters you use (datatype, sequence length, number of heads) - so we can add them to our benchmarks?

danthe3rd avatar Sep 16 '22 15:09 danthe3rd

I've tried running the code in this PR, but I'm getting the following error:

AttributeError: module 'triton.language' has no attribute 'constexpr'

when installing xformers with pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers

Do I need a specific version of triton cc @MatthieuTPHR ?

patrickvonplaten avatar Sep 16 '22 15:09 patrickvonplaten

Hi @MatthieuTPHR - this looks like a great improvement!

Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion. @blefaudeux and I would be happy to contribute :)

We've been improving the forward (including fairly recently, in facebookresearch/xformers#388 for instance). Do you mind sharing the other parameters you use (datatype, sequence length, number of heads) - so we can add them to our benchmarks?

typically with stable diffusion 512x512 yields a 64x64 latent space -> 4096 tokens (but higher res would be even better). After that the embedding is 320 over 8 heads, which yields a head dim of 40. For inference batch is typically 2 (conditioned and unconditioned diffusion), but can depend on the methods. So [16, 4096, 40] is a good baseline, [16, 16384, 40] is a bonus (1024 rendering)

edit: folding the number of heads in the batch to better give a sense of the tensor sizes in practice

blefaudeux avatar Sep 16 '22 15:09 blefaudeux

Hi @MatthieuTPHR - this looks like a great improvement!

Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion. @blefaudeux and I would be happy to contribute :)

We've been improving the forward (including fairly recently, in facebookresearch/xformers#388 for instance). Do you mind sharing the other parameters you use (datatype, sequence length, number of heads) - so we can add them to our benchmarks?

typically with stable diffusion 512x512 yields a 64x64 latent space -> 4096 tokens (but higher res would be even better). After that the embedding is 320 over 8 heads, which yields a head dim of 40. For inference batch is typically 2 (conditioned and unconditioned diffusion), but can depend on the methods. So [2, 4096, 40] is a good baseline, [2, 16384, 40] is a bonus (1024 rendering)

I use triton==2.0.0.dev20220701.

For the 1024x1024 on the A6000 I have 4 iterations per second

MatthieuToulemont avatar Sep 16 '22 15:09 MatthieuToulemont

I've tried running the code in this PR, but I'm getting the following error:

AttributeError: module 'triton.language' has no attribute 'constexpr'

when installing xformers with pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers

Do I need a specific version of triton cc @MatthieuTPHR ?

specifically triton == 2.0.0.dev20220701 will work, it needs to be updated on xformers but that's WIP (the newer ones break a couple of kernels)

edit: but no triton installed should also work actually

blefaudeux avatar Sep 16 '22 15:09 blefaudeux

I'm testing on a NVIDIA TITAN RTX on this branch with the following package dependencies:

- CUDA Version: 11.6
- torch: 1.12.1+cu102
- xformers: 0.0.13.dev (installed with `pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers`)
- triton: 2.0.0 (installed with `pip install triton==2.0.0.dev20220701`)

Note:

When I import xformers I'm getting:

libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
0.0.13.dev

With this setup I'm running:

from diffusers import StableDiffusionPipeline
import numpy as np
import torch

model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(
    model_id,
    use_auth_token=True,
)
pipe.to("cuda")

prompt = "A fantasy landscape, trending on artstation"
generator = torch.Generator(device="cuda").manual_seed(0)

with torch.autocast("cuda"):
    output = pipe(prompt=prompt, guidance_scale=7.5, generator=generator, output_type="np")

print(np.sum(np.abs(output.images[:3, :3, :3, :3])))
mem_bytes = torch.cuda.max_memory_allocated()
print(mem_bytes)
  1. Without xformers simply on the current main branch. Without xformers I'm getting: 9.15iterations/sec
  2. I'm running on this branch with:
export USE_MEMORY_EFFICIENT_ATTENTION=1

And I'm getting the exact same speed.

Any ideas what could be the problem here?

patrickvonplaten avatar Sep 16 '22 16:09 patrickvonplaten

Hi @patrickvonplaten

It might look like the CUDA extensions are not being compiled when installing xformers.

can you try doing

import torch, xformers.ops

print(torch.ops.xformers.efficient_attention_forward_generic)

and see if ti prints something like <OpOverloadPacket(op='xformers.efficient_attention_forward_generic')> ?

If that doesn't print what I mentioned, there are a few options why this isn't being compiled: A few questions:

  • do you have CUDA setup in the machine you installed xformers?
  • do you have a nvcc in the machine you used to install xformers?

fmassa avatar Sep 16 '22 16:09 fmassa

Hi @patrickvonplaten ,

Here is my full setup:

sudo docker run -it --gpus=all --ipc=host -v /home:/home nvcr.io/nvidia/pytorch:22.08-py3 bash

# Then 
pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers
pip install transformers ftfy scipy

# Followed by
cd PATH_TO_DIFFUSER_FOLDER
git checkout memory_efficient_attention
pip install -e . 

Once this is done I can notice the following speedup: 10 iterations per seconds without and 21 with, both at 512x512 in fp16.

MatthieuToulemont avatar Sep 16 '22 16:09 MatthieuToulemont

Thank you @MatthieuTPHR, super exited to see ideas on fast & memory-efficient attention having an impact!

tridao avatar Sep 16 '22 17:09 tridao

Hey @fmassa,

When running:

import torch, xformers.ops

print(torch.ops.xformers.efficient_attention_forward_generic)

I'm getting:

'_OpNamespace' object has no attribute 'efficient_attention_forward_generic'

There already seems to be a problem I guess?

It would be really nice if we could somehow show the community that it's easy to install and use :-)

patrickvonplaten avatar Sep 16 '22 17:09 patrickvonplaten

@patrickvonplaten this means that indeed xformers was compiled without the CUDA extensions.

It would be really nice if we could somehow show the community that it's easy to install and use :-)

Yes, I totally agree, and we are working on that :-)

If you are not compiling xformers on a machine with CUDA (i.e., if you pip install in a machine without GPUs and then ssh to a machine with CPUs), you could try using FORCE_CUDA=1 pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers to try and compile xformers. Compiling it with its CUDA extensions should take a while, so it might be useful to pass --verbose to pip to see what are the things that are being done.

fmassa avatar Sep 16 '22 17:09 fmassa

Hi @MatthieuTPHR - this looks like a great improvement!

Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion. @blefaudeux and I would be happy to contribute :)

We've been improving the forward (including fairly recently, in facebookresearch/xformers#388 for instance). Do you mind sharing the other parameters you use (datatype, sequence length, number of heads) - so we can add them to our benchmarks?

We are using the default parameters from the CompVis repo, I believe the parameters are as follows:

  • sequence_length for a 512x512 input: q_len = 64x64 = 4096
  • number of heads = 8
  • datatype: by default the HF repo allows for FP32, FP16, BF16

The sequence length could also be higher if we use a 1024x1024 or 2048x2048 input. The downscale factor between the input and the latent space is 8.

MatthieuToulemont avatar Sep 16 '22 20:09 MatthieuToulemont

Does it require a GPU with tensor cores (RTX 20 Series and above) ? getting : WARNING:root:Blocksparse is not available: the current GPU does not expose Tensor cores

then :

Could not run 'xformers::efficient_attention_forward_generic' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build)

TheLastBen avatar Sep 17 '22 16:09 TheLastBen

@TheLastBen what is your GPU model? xformers supports architectures above sm60 (P100+) - and possibly above sm50 (untested). The most important speedups are achieved on GPUs with tensor cores (sm70+ aka V100 and later), but it's not a requirement

danthe3rd avatar Sep 17 '22 17:09 danthe3rd

@danthe3rd I have a gtx 1070ti, the message is coming from Triton so I don't think it's the main cause for the crash

TheLastBen avatar Sep 17 '22 18:09 TheLastBen

Does it require a GPU with tensor cores (RTX 20 Series and above) ? getting : WARNING:root:Blocksparse is not available: the current GPU does not expose Tensor cores

then :

Could not run 'xformers::efficient_attention_forward_generic' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build)

these are two different topics:

  • first message is a warning in that the blocksparse attention will not be available, it's fine for you because you're not targeting this usecase anyway
  • the second message is your issue, when doing the xformers setup it could not build the CUDA kernels, so it does not find them at runtime. You can refer to this message to get more info around that, I would recommend a conda env or docker where all the cuda and torch components align (hard to get right outside of that)

blefaudeux avatar Sep 17 '22 20:09 blefaudeux

@blefaudeux I'm using linux in wsl2, the problem might be related to the version of torch and torchvision

TheLastBen avatar Sep 17 '22 20:09 TheLastBen

@blefaudeux I'm using linux in wsl2, the problem might be related to the version of torch and torchvision

really sorry about that.. are you able to use conda there ? conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge should make sure that almost everything matches, unless there are driver mismatches (I'm not sure of how that works with WSL). Else maybe docker, if you start with an nvidia docker image as suggested by @MatthieuTPHR, this should run even on windows I guess

blefaudeux avatar Sep 17 '22 20:09 blefaudeux

I would advise using the following image: nvcr.io/nvidia/pytorch:22.08-py3 as it comes with torch and torchvision installed.

MatthieuToulemont avatar Sep 17 '22 20:09 MatthieuToulemont

Tested on GTX 1070ti : Without Memory efficient cross attention at 512x512 : 1.78 it/s

With Memory efficient cross attention at 512x512 : 2.34 it/s

+31% increase in speed

With optimized SD + Dogettx : at 512x512 : 2.0 it/s (+12% increase in speed)

Great work !

TheLastBen avatar Sep 18 '22 18:09 TheLastBen

Tested on GTX 1070ti : Without Memory efficient cross attention at 512x512 : 1.78 it/s

With Memory efficient cross attention at 512x512 : 2.34 it/s

+31% increase in speed

With optimized SD + Dogettx : at 512x512 : 2.0 it/s (+12% increase in speed)

With Only Doggettx modification, the speed isn't affected : 2.34 it/s

Great work !

I asked on the web repo, but with an "old" GPU the memory use is possibly the best benefit for you, I don't know if you checked that but the improvement should be pretty significant (so possible to do higher res). This PR could also be improved a bit because there are buffer reorderings which can be a little costly, but it will probably not move the needle that much.

blefaudeux avatar Sep 18 '22 19:09 blefaudeux

Great work! @MatthieuTPHR I was able to get a +60% speed up on a A40 on unet But this seems to break torch.jit.trace, I'm getting this error:

RuntimeError: unsupported output type: int, from operator: xformers::efficient_attention_forward_generic Is there a quick work around?

Dango233 avatar Sep 19 '22 13:09 Dango233

Great work! @MatthieuTPHR I was able to get a +60% speed up on a A40 on unet But this seems to break torch.jit.trace, I'm getting this error:

RuntimeError: unsupported output type: int, from operator: xformers::efficient_attention_forward_generic Is there a quick work around?

It seems like currently there is no torch.jit support, I am investigating that at the moment

MatthieuToulemont avatar Sep 19 '22 13:09 MatthieuToulemont

@MatthieuTPHR adding torchscript support is possible, although we would need to write the torch.autograd.Function directly in C++ (we've done that in the past, but we didn't have the need for this in xformers so far).

fmassa avatar Sep 19 '22 14:09 fmassa

Understood if we want gradient computed. For forward pass only jit, will fixing the output type of the op work? The int output of the op is where jit breaks.


From: Francisco Massa @.> Sent: Monday, September 19, 2022 10:34:39 PM To: huggingface/diffusers @.> Cc: Dango233 @.>; Comment @.> Subject: Re: [huggingface/diffusers] Up to 2x speedup on GPUs using memory efficient attention (PR #532)

@MatthieuTPHRhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMatthieuTPHR&data=05%7C01%7C%7C78d079afa1014b5f2dc808da9a4c15c9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637991948816866625%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KDx0cZhRnurkTH79wUMSulVbcOsqvTbJStMTHpleEh4%3D&reserved=0 adding torchscript support is possible, although we would need to write the torch.autograd.Function directly in C++ (we've done that in the past, but we didn't have the need for this in xformers so far).

— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fdiffusers%2Fpull%2F532%23issuecomment-1251110141&data=05%7C01%7C%7C78d079afa1014b5f2dc808da9a4c15c9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637991948816866625%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dCfcx4rbO3iheDfgCovERL0gdE4pqX7djJyLNgtaMsc%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADXMUD3A5CXYP7VMQ7RS2RDV7B2X7ANCNFSM6AAAAAAQOM4YKQ&data=05%7C01%7C%7C78d079afa1014b5f2dc808da9a4c15c9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637991948816866625%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=arSBkGzOQqpjL5eVoSDD01jQqp9NpvbAv3OQvty32jo%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

Dango233 avatar Sep 19 '22 16:09 Dango233

I believe the ints might be the random seed in case of dropout (which we don't use anyway). That's something we should be able to fix. Let's move the discussion to an issue in xformers tho: https://github.com/facebookresearch/xformers/issues

danthe3rd avatar Sep 19 '22 16:09 danthe3rd