diffusers RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans support the graph.

Describe the bug

Hello. I tried the Img2Img Pipeline and encountered the error in the images. Could you please check it for me? Thank you Screenshot 2024-10-17 at 11 39 30

Reproduction

import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image

pipeline = AutoPipelineForImage2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5/", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()


url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"


image = pipeline(prompt, image=init_image).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

Logs

No response

System Info

diffusers 0.30.3 Python 3.9.20

Who can help?

No response

Oct 17 '24 18:10 alansmithee-cpu

What version of pytorch are you using? It seems like this error comes from the latest changes in pytorch. this, this and this.

Oct 17 '24 19:10 a-r-r-o-w

What version of pytorch are you using? It seems like this error comes from the latest changes in pytorch. this, this and this.

I'm using torch 2.5.0+cu124

Oct 17 '24 19:10 alansmithee-cpu

Could you try the 2.4.0 stable release and see if the problem persists?

Oct 17 '24 19:10 a-r-r-o-w

Could you try the 2.4.0 stable release and see if the problem persists?

Now I encountered this error Screenshot 2024-10-17 at 12 16 22

Oct 17 '24 19:10 alansmithee-cpu

If you're running in a notebook, make sure to restart it and please do a clean reinstall of v0.30.3. Auraflow was released in v0.30.0, so this should not lead to any errors. Just to be sure that there are no longer any environment errors, could you paste the output of diffusers-cli env?

Oct 17 '24 19:10 a-r-r-o-w

If you're running in a notebook, make sure to restart it and please do a clean reinstall of v0.30.3. Auraflow was released in v0.30.0, so this should not lead to any errors. Just to be sure that there are no longer any environment errors, could you paste the output of diffusers-cli env?

Yes, I've reinstalled v0.30.0 (image 1), but have the error in image 2

Oct 17 '24 19:10 alansmithee-cpu

Hello, the problem is now solved, thank you for your time and consideration.

Here are the version that worked for me Diffusers: v.0.30.3 Torch: 2.4.0+cu121

Oct 17 '24 19:10 alansmithee-cpu

I am facing the same error on torch 2.5.0+cu124. The error is preceded by the following warning:

cuDNN SDPA backward got grad_output.strides() != output.strides()

I'm on an H100, I'm guessing this has to do with the new cuDNN SDPA backend introduced in PyTorch 2.5

Oct 18 '24 10:10 readleyj

Yes, this seems like a problem with torch 2.5.0, and I've been able to reproduce this now as well. We'll need to take a look into how best to fix this (either on our end or we could talk with the pytorch folks) cc @sayakpaul @DN6 @yiyixuxu. Re-opening the issue for now

Oct 18 '24 10:10 a-r-r-o-w

As a work around you can disable the cudnn backend via https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.enable_cudnn_sdp

Would you mind opening an issue on PyTorch with a smallish repro, I can then forward to the Nvidia folks

Oct 18 '24 13:10 drisspg

torch==2.5.0 breaks sdpa functionality used by transformers which is used by diffusers for clip during prompt encoding

transformers/models/clip/modeling_clip.py:491 in forward

490 │   │   # CLIP text model uses both `causal_attention_mask` and `attention_mask` sequentially.
491 │   │   attn_output = torch.nn.functional.scaled_dot_product_attention( ... )

yes, torch.backends.cuda.enable_cudnn_sdp(False) is a workaround, but comes at a massive performance cost.

imo, this should be reported to transformers team as they can implement a workaround much faster than torch releasing a service pack which takes a while. (from what I gather, issue has been caught first back in May in cudnn-frontend package and it's still not assigned)

Oct 18 '24 14:10 vladmandic

but comes at a massive performance cost. The performance should be the same as in 2.4.1 since this is the first release with cuDNN backend enabled.

Can you link the frontend issue

Oct 18 '24 17:10 drisspg

Seems to be NVIDIA/cudnn-frontend#75 and https://github.com/NVIDIA/cudnn-frontend/issues/78

Oct 18 '24 17:10 readleyj

Having this issue as well but only on linux. no problems with cuda on windows.

Oct 18 '24 18:10 JackismyShephard

but comes at a massive performance cost. The performance should be the same as in 2.4.1 since this is the first release with cuDNN backend enabled.

Can you link the frontend issue

performance deg is from 6its to 2.5its using sdxl and having everything the same expect that one param.

links to issues are already posted below.

Oct 18 '24 19:10 vladmandic

The cuDNN issues linked are generic across any unsupported config and may not correspond to this particular issue. Would it be possible to link a shorter repro as I'm currently trying to clone stable-diffusion-v1-5/stable-diffusion-v1-5/ which seems to be > 10GiB?

Oct 18 '24 22:10 eqy

here's the shortest reproduction, like i said its when transformers uses sdp to process clip:

import torch
from transformers import CLIPTextModel, AutoTokenizer

device = torch.device('cuda')
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
encoder = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32", cache_dir='/mnt/models/huggingface').to(device=device, dtype=torch.float16)

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
print(inputs)
outputs = encoder(**inputs)
print(outputs)

  File "/home/vlado/dev/clip/venv/lib/python3.12/site-packages/transformers/models/clip/modeling_clip.py", line 491, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans support the graph.

btw, i just noticed that there is no issue when using torch.float32. but nobody uses torch.float32 anymore. and yes, this is the same issue as noted here with diffusion models - its when encoding prompt. you can try to use any other clip model as long as underlying processor is the same.

Oct 19 '24 00:10 vladmandic

Thanks, and it not happening with float32 is expected as PyTorch will not dispatch to cuDNN for float32

Oct 19 '24 00:10 eqy

@vladmandic I am not seeing the same error locally with cuDNN 9.3. Which GPU are you on? I will try 9.1.7 in the meantime

Oct 19 '24 00:10 eqy

@vladmandic I am not seeing the same error locally with cuDNN 9.3. Which GPU are you on? I will try 9.1.7 in the meantime

print(f'torch={torch.__version__} cuda={torch.version.cuda} cuDNN={torch.backends.cudnn.version()} device={torch.cuda.get_device_name(0)} cap={torch.cuda.get_device_capability(0)}')

torch=2.5.0+cu124 cuda=12.4 cuDNN=90100 device=NVIDIA GeForce RTX 4090 cap=(8, 9)

note that cuda and cudnn are ones that come with torch. if torch 2.5 requires newer cudnn, it should handle its installation. this is simple pip install torch transformers in a clean venv and without any extra flags.

Oct 19 '24 00:10 vladmandic

Yes, 9.1.0.70 is what comes with cuDNN and I didn't see the failure on L40, L4, or RTX 6000 Ada which are also sm89 (it is able to generate and run a kernel).

I'm thinking that maybe the issue is the CUDA version, will also try that later.

Oct 19 '24 01:10 eqy

Even a clean environment didn't help me. I had to install torch=2.4.0 to get rid of the issue.

Oct 19 '24 08:10 soumendukrg

Hmm. How much speedup does one get when using CLIP in SDPA? I remember when we incorporated SDPA in CLIP the speedup wasn't that significant.

We could verify this by instantiating the CLIP with:

text_encoder = CLIPTextModel.from_pretrained(..., attn_implementation="eager", ...)
pipeline = DiffusionPipeline.from_prertrained(..., text_encoder=text_encoder)

Cc: @ArthurZucker

Oct 19 '24 09:10 sayakpaul

Hmm. How much speedup does one get when using CLIP in SDPA? I remember when we incorporated SDPA in CLIP the speedup wasn't that significant.

i tried using torch==2.4.1 with default sdp and with eager and for 1,000 iterations i'm getting 4.72s vs 7.59s, so pretty significant impact at 60% slower. good thing is that encoding only happens once so overall performance hit would hardly be seen. but in how many places would this need to be touched?

Oct 19 '24 15:10 vladmandic

but in how many places would this need to be touched?

You mean changing the CLIP (and potentially other models from transformers we rely on in diffusers) to use "eager" as attn_implementation?

I guess we have a couple of ways but I think we could pass this info to load_method here: https://github.com/huggingface/diffusers/blob/5d3e7bdaaadfdcf5781e0665b952d1520e84c310/src/diffusers/pipelines/pipeline_loading_utils.py#L700

Something like (pseudo-code):

if is_transformers_model:
    if is_transformers_version(...):
        if is_torch_version(">=", "2.5"):
        loading_kwargs.update({"attn_implementation": "eager"})

@DN6 WDYT? Or maybe @ArthurZucker from transformers has a better idea.

Oct 19 '24 16:10 sayakpaul

@vladmandic does your output look similar to this? I was able to run on the 2.5.0 binary on RTX 6000 (Ada)

import torch
from transformers import CLIPTextModel, AutoTokenizer

print(f"cuda: {torch.version.cuda} cudnn: {torch.backends.cudnn.version()} compute capability: {torch.cuda.get_device_capability()}")

device = torch.device('cuda')
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
encoder = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32", cache_dir='/mnt/models/huggingface').to(device=device, dtype=torch.float16)

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
print(inputs)
outputs = encoder(**inputs)
print(outputs)

cuda: 12.4 cudnn: 90100 compute capability: (8, 9)
{'input_ids': tensor([[49406,   320,  1125,   539,   320,  2368, 49407],
        [49406,   320,  1125,   539,   320,  1929, 49407]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.3391,  0.1165,  0.1020,  ...,  0.2469,  0.5903,  0.1014],
         [ 1.9775, -0.5840,  0.3699,  ...,  1.1670,  0.8047, -0.9795],
         [ 1.0586, -0.9580,  1.0039,  ..., -0.5151, -0.1436, -1.9443],
         ...,
         [ 0.3076, -1.4961, -0.4001,  ..., -0.0224,  0.9111, -0.3879],
         [ 1.0117, -0.6704,  1.7734,  ..., -0.1541, -0.0244, -1.5059],
         [-0.5151,  0.1665,  0.8887,  ..., -0.0677, -0.4563, -1.7959]],

        [[ 0.3391,  0.1165,  0.1020,  ...,  0.2469,  0.5903,  0.1014],
         [ 1.9775, -0.5840,  0.3699,  ...,  1.1670,  0.8047, -0.9795],
         [ 1.0586, -0.9580,  1.0039,  ..., -0.5151, -0.1436, -1.9443],
         ...,
         [ 0.3076, -1.4961, -0.4001,  ..., -0.0224,  0.9111, -0.3879],
         [-0.1440, -0.5166,  1.7109,  ..., -0.0795,  0.3611, -1.2441],
         [ 0.0415,  0.0185,  1.2754,  ..., -0.4209, -0.4387, -1.3018]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.5151,  0.1665,  0.8887,  ..., -0.0677, -0.4563, -1.7959],
        [ 0.0415,  0.0185,  1.2754,  ..., -0.4209, -0.4387, -1.3018]],
       device='cuda:0', dtype=torch.float16, grad_fn=<IndexBackward0>), hidden_states=None, attentions=None)

Oct 21 '24 18:10 eqy

i have a 3090, i can't run some of my lora trainers because of this issue. I am honestly wondering do the pytorch people really give a damn about us users who are effected instead of just ignoring this?

Oct 21 '24 18:10 CodeAlexx

@CodeAlexx do you have a repro?

Oct 21 '24 19:10 eqy

@vladmandic does your output look similar to this?

that is correct output when using either torch==2.4.1 or when disabling cudnn for sdpa on torch==2.5.0 with cudnn-for-sdpa enabled, i'm getting error as previously noted. i can run any test/debug you want, just lmk.

Oct 21 '24 19:10 vladmandic

@vladmandic thanks for volunteering! Could you send the result of CUDNN_FRONTEND_LOG_FILE=frontendlog.txt CUDNN_FRONTEND_LOG_INFO=1 CUDNN_LOGLEVEL_DBG=3 CUDNN_LOGDEST_DBG=backendlog.txt python3 yourscript.py?

Oct 21 '24 19:10 eqy