diffusers Meet error when enabling xformers and doing loss backward

Describe the bug

Get error when I enable xformers of UNet and try to do backward:

Traceback (most recent call last):
  File "f:/diffusers-test/vae_expr.py", line 66, in <module>
    loss.backward()
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\autograd\__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\autograd\function.py", line 253, in apply   
    return user_fn(self, *args)
  File "f:\xformers\xformers\ops\memory_efficient_attention.py", line 414, in backward
    causal=ctx.causal,
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: p.gQ_strideM() == grad_q.stride(1) INTERNAL ASSERT FAILED at "F:\\xformers\\xformers\\components\\attention\\csrc\\cuda\\mem_eff_attention\\attention_backward_generic.cu":181, please report a bug to PyTorch.

Reproduction

import argparse
import logging
import math
import os
import random
from pathlib import Path
from typing import Iterable, Optional

import numpy as np
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from diffusers import AutoencoderKL, DDPMScheduler, PNDMScheduler, StableDiffusionPipeline, UNet2DConditionModel
from diffusers.optimization import get_scheduler
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from torchvision import transforms
from tqdm.auto import tqdm
from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer

pretrained_model_name_or_path = r'F:\diffusers-weight'
# Load models and create wrapper for stable diffusion
tokenizer = CLIPTokenizer.from_pretrained(pretrained_model_name_or_path, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(pretrained_model_name_or_path, subfolder="text_encoder")
vae = AutoencoderKL.from_pretrained(pretrained_model_name_or_path, subfolder="vae")
unet = UNet2DConditionModel.from_pretrained(pretrained_model_name_or_path, subfolder="unet")
noise_scheduler = DDPMScheduler.from_config(pretrained_model_name_or_path, subfolder="scheduler")

# Freeze vae and text_encoder
vae.requires_grad_(False)
text_encoder.requires_grad_(False)


weight_dtype = torch.bfloat16
# Move text_encode and vae to gpu.
# For mixed precision training we cast the text_encoder and vae weights to half-precision
# as these models are only used for inference, keeping weights in full precision is not required.
text_encoder.to('cuda', dtype=weight_dtype)
vae.to('cuda', dtype=weight_dtype)
unet.to('cuda', dtype=weight_dtype)
unet.set_use_memory_efficient_attention_xformers(True)

                # Convert images to latent space
images = torch.randn(1,3,512,512).to('cuda', dtype=weight_dtype)
latents = vae.encode(images).latent_dist.sample()
latents = latents * 0.18215
# Convert images to latent space
# Sample noise that we'll add to the latents
noise = torch.randn_like(latents)
bsz = latents.shape[0]
# Sample a random timestep for each image
timesteps = torch.randint(0, noise_scheduler.num_train_timesteps, (bsz,), device=latents.device)
timesteps = timesteps.long()
# Add noise to the latents according to the noise magnitude at each timestep
# (this is the forward diffusion process)
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
# Get the text embedding for conditioning
inputs = tokenizer('Terwt dsfs gsdgs sg"', max_length=tokenizer.model_max_length, padding="do_not_pad", truncation=True)
input_ids = [inputs["input_ids"]]
padded_tokens = tokenizer.pad({"input_ids": input_ids}, padding=True, return_tensors="pt")
input_ids = padded_tokens.input_ids.to('cuda', dtype=torch.int)
encoder_hidden_states = text_encoder(input_ids)[0]

# Predict the noise residual and compute loss
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
loss = F.mse_loss(noise_pred.float(), noise.float(), reduction="mean")
loss.backward()

Logs

Traceback (most recent call last):
  File "f:/diffusers-test/vae_expr.py", line 66, in <module>
    loss.backward()
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\autograd\__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\autograd\function.py", line 253, in apply   
    return user_fn(self, *args)
  File "f:\xformers\xformers\ops\memory_efficient_attention.py", line 414, in backward
    causal=ctx.causal,
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: p.gQ_strideM() == grad_q.stride(1) INTERNAL ASSERT FAILED at "F:\\xformers\\xformers\\components\\attention\\csrc\\cuda\\mem_eff_attention\\attention_backward_generic.cu":181, please report a bug to PyTorch.

System Info

diffusers version: 0.7.2
Platform: Windows-10-10.0.19041-SP0
Python version: 3.7.7
PyTorch version (GPU?): 1.12.0+cu113 (True)
Huggingface_hub version: 0.10.1
Transformers version: 4.24.0
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

xformers version: efdca026381a13319be082f079c60275cc871301 https://github.com/facebookresearch/xformers/commit/efdca026381a13319be082f079c60275cc871301

Nov 16 '22 15:11 eeyrw

Uff this looks like a bug with the xformers installation. We're working at the moment at making it easier to install xformers - see: https://github.com/facebookresearch/xformers/pull/523

cc @patil-suraj @NouamaneTazi

Nov 18 '22 15:11 patrickvonplaten

I built xformers with NVCC and Visual C++ 2019. But I also tested on Linux with xformers built with NVCC and GCC. I met same error.

Nov 18 '22 15:11 eeyrw

Could you maybe open an issue on the xformers repo instead? https://github.com/facebookresearch/xformers/pull/523

Nov 20 '22 19:11 patrickvonplaten

Let me consult them. But simpler test case that can reproduce same bug is better. Do you have any suggestion?

Nov 21 '22 01:11 eeyrw

This indeed seems like a build issue. We've been using xformers extensively for training now and haven't seen this error, can't reproduce on my end.

Nov 23 '22 10:11 patil-suraj

I was able to reproduce the issue when using bfloat16, it seems bf16 is not supported for backward. But it should work fine with fp32 and fp16

Nov 23 '22 12:11 patil-suraj

Thank for your reproducing. But no luck to me. I tried different datatype then get same error when switching to float16 and new error to float32....

    noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\diffusers\models\unet_2d_condition.py", line 310, in forward        
    encoder_hidden_states=encoder_hidden_states,
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\diffusers\models\unet_2d_blocks.py", line 598, in forward
    hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states).sample
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\diffusers\models\attention.py", line 202, in forward
    hidden_states = block(hidden_states, context=encoder_hidden_states, timestep=timestep)
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\diffusers\models\attention.py", line 404, in forward
    hidden_states = self.attn1(norm_hidden_states) + hidden_states
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\diffusers\models\attention.py", line 494, in forward
    hidden_states = self._memory_efficient_attention_xformers(query, key, value)
  File "C:\Users\uuu\.virtualenvs\stable-diffusion\lib\site-packages\diffusers\models\attention.py", line 558, in _memory_efficient_attention_xformers
    hidden_states = xformers.ops.memory_efficient_attention(query, key, value, attn_bias=None)
  File "f:\xformers\xformers\ops\memory_efficient_attention.py", line 922, in memory_efficient_attention
    query=query, key=key, value=value, attn_bias=attn_bias, p=p
  File "f:\xformers\xformers\ops\memory_efficient_attention.py", line 792, in op
    raise NotImplementedError(f"No operator found for this attention: {self}")
NotImplementedError: No operator found for this attention: AttentionOpDispatch(dtype=torch.float32, device=device(type='cuda', index=0), k=80, has_dropout=False, attn_bias_type=<class 'NoneType'>, kv_len=1024, q_len=1024, kv=80, batch_size=8, num_heads=1, requires_grad=True

For convenience of reproducing, I make a more concise test case:

import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from diffusers import UNet2DConditionModel

weight_dtype = torch.float32

cfg = {
    "_class_name": "UNet2DConditionModel",
    "_diffusers_version": "0.8.0.dev0",
    "act_fn": "silu",
    "attention_head_dim": 8,
    "block_out_channels": [
        320,
        640,
        1280,
        1280
    ],
    "center_input_sample": False,
    "cross_attention_dim": 768,
    "down_block_types": [
        "CrossAttnDownBlock2D",
        "CrossAttnDownBlock2D",
        "CrossAttnDownBlock2D",
        "DownBlock2D"
    ],
    "downsample_padding": 1,
    "flip_sin_to_cos": True,
    "freq_shift": 0,
    "in_channels": 4,
    "layers_per_block": 2,
    "mid_block_scale_factor": 1,
    "norm_eps": 1e-05,
    "norm_num_groups": 32,
    "out_channels": 4,
    "sample_size": 32,
    "up_block_types": [
        "UpBlock2D",
        "CrossAttnUpBlock2D",
        "CrossAttnUpBlock2D",
        "CrossAttnUpBlock2D"
    ]
}
unet = UNet2DConditionModel(**cfg)

unet.to('cuda', dtype=weight_dtype)
unet.set_use_memory_efficient_attention_xformers(True)
noise = torch.randn(1, 4, 64, 64).to('cuda', dtype=weight_dtype)
noisy_latents = torch.randn(1, 4, 64, 64).to('cuda', dtype=weight_dtype)
timesteps = torch.tensor(543, device='cuda', dtype=torch.int64)
encoder_hidden_states = torch.randn(1, 10, 768).to('cuda', dtype=weight_dtype)
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
loss = F.mse_loss(noise_pred.float(), noise.float(), reduction="mean")
loss.backward()

Nov 23 '22 13:11 eeyrw

Yep, getting the same error

NotImplementedError: No operator found for this attention: AttentionOpDispatch(dtype=torch.float32, device=device(type='cuda', index=0), k=80, has_dropout=False, attn_bias_type=<class 'NoneType'>, kv_len=1024, q_len=1024, kv=80, batch_size=8, num_heads=1, requires_grad=True)

when trying to run the train_dreambooth script in FP32 mode.

xformers.info

xFormers 0.0.15.dev+103e863.d20221123
memory_efficient_attention.flshatt:      available - requires GPU with compute capability 7.5+
memory_efficient_attention.cutlass:      available
memory_efficient_attention.small_k:      available
swiglu.fused.p.cpp:                      available
is_triton_available:                     True
is_functorch_available:                  False
pytorch.version:                         1.12.1
pytorch.cuda:                            available
gpu.compute_capability:                  8.6
gpu.name:                                NVIDIA A10G

Nov 24 '22 13:11 bprabhakar

@patil-suraj Can you you share which commit did you build xformers from?

Nov 24 '22 13:11 bprabhakar

@patil-suraj is this still relevant?

Nov 30 '22 11:11 patrickvonplaten

After using latest diffusers(0.9.0) and latest xformers, type float16 and bfloat16 are fine but float32 still do not work.

Nov 30 '22 14:11 eeyrw

@eeyrw,

Could you try to use pip wheels for xformers from https://github.com/TheLastBen/fast-stable-diffusion/tree/main/precompiled ?

Dec 01 '22 17:12 patrickvonplaten

@patrickvonplaten There is no suitable wheel for my graphics card. Furthermore, I tried other platform (Linux) and other GPU (A5000) and get all things worked... I just give up fighting with Windows OS and that xformers because I actually never use float32 no matter inference or training... Let we forget it.

Dec 03 '22 05:12 eeyrw

I'm building xformers as follows, using the specific commit.

pip install git+https://github.com/facebookresearch/xformers@7e4c02c#egg=xformers and this works well for me on linux with both fp32 and fp16. Not really sure about windows issue.

Dec 05 '22 16:12 patil-suraj

I asked some community members about their windows setup for xformers, will link it here

https://github.com/C43H66N12O12S2/stable-diffusion-webui/releases
https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Xformers
for T4 and A10G https://github.com/camenduru/stable-diffusion-webui-colab/releases/tag/0.0.15
https://anaconda.org/xformers/xformers
https://gist.github.com/geocine/e51fcc8511c91e4e3b257a0ebee938d0

Thanks a lot @camenduru and @nitrosocke!

Dec 06 '22 13:12 patil-suraj

Also cc @pcuenca, we should probably add those installs to the README of stable diffusion

Dec 12 '22 10:12 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 05 '23 15:01 github-actions[bot]

According newest progress made by facebook, the bug has been fixed: https://github.com/facebookresearch/xformers/issues/535#issuecomment-1375555450

Jan 10 '23 05:01 eeyrw

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Feb 03 '23 15:02 github-actions[bot]