🐛 Bug

I call func _memory_efficient_attention_xformers to calc results with attention_mask arg, and my attention_mask is a torch.Tensor. While runing, it gave a RuntimeError: CUDA error: invalid configuration argument error.

To Reproduce

My input tensors' size are

query       : shape=(262144, 1, 1, 40) (torch.float16)                                                                                                                                                                  
key         : shape=(262144, 121, 1, 40) (torch.float16)                                                                                                                                                                
value       : shape=(262144, 121, 1, 40) (torch.float16)                                                                                                                                                                
attn_bias   : <class 'torch.Tensor'>

The third dim 1 is added by xformer, and my attn_bias actually is (262144, 1, 121). It first show that

HINT: To use an `attn_bias` with a sequence length that is not a multiple of 8, you need to ensure memory is aligned by slicing a bigger tensor. Example: use `attn_bias = torch.zeros([1, 1, 5, 8])[:,:,:,:5]` instead o
f `torch.zeros([1, 1, 5, 5])`

. After I followed this and changed attn_bias to [262144, 1, 128]->[262144, 1, :121], it cames the CUDA error. Here is the complete error message:

File "/mnt/data/yuxingyuan/MotionControl/CameraCtrl/animatediff/models/motion_module.py", line 597, in forward
    hidden_states = self._memory_efficient_attention_xformers(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/diffusers/models/attention.py", line 728, in _memory_efficient_attention_xformers
    hidden_states = xformers.ops.memory_efficient_attention(query, key, value, attn_bias=attention_mask)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/__init__.py", line 223, in memory_efficient_attention
    return _memory_efficient_attention(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/__init__.py", line 326, in _memory_efficient_attention
    return _fMHA.apply(
           ^^^^^^^^^^^^
  File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/__init__.py", line 42, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/__init__.py", line 354, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/cutlass.py", line 202, in apply
    return cls.apply_bmhk(inp, needs_gradient=needs_gradient)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/cutlass.py", line 266, in apply_bmhk
    out, lse, rng_seed, rng_offset = cls.OPERATOR(
                                     ^^^^^^^^^^^^^
  File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: invalid configuration argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Environment

My XFormers verision is 0.0.24. Other version info is shown below:

PyTorch version: 2.2.0                                                                                                                                                                                                       
Is debug build: False                                                                                                                                                                                                        
CUDA used to build PyTorch: 12.1                                                                                                                                                                                             
ROCM used to build PyTorch: N/A                                                                                                                                                                                              
                                                                                                                                                                                                                             
OS: Ubuntu 20.04.6 LTS (x86_64)                                                                                                                                                                                              
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0                                                                                                                                                                           
Clang version: Could not collect                                                                                                                                                                                             
CMake version: version 3.28.20231226-gb2ea53f                                                                                                                                                                                
Libc version: glibc-2.31                                                                                                                                                                                                     
                                                                                                                                                                                                                             
Python version: 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] (64-bit runtime)                                                                                                                                           
Python platform: Linux-5.4.0-144-generic-x86_64-with-glibc2.31                                                                                                                                                               
Is CUDA available: True                                                                                                                                                                                                      
CUDA runtime version: Could not collect                                                                                                                                                                                      
CUDA_MODULE_LOADING set to: LAZY                                                                                                                                                                                             
GPU models and configuration:                                                                                                                                                                                                
GPU 0: NVIDIA A800-SXM4-80GB                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
Nvidia driver version: 525.147.05                                                                                                                                                                                            
cuDNN version: Could not collect                                                                                                                                                                                             
HIP runtime version: N/A                                                                                                                                                                                                     
MIOpen runtime version: N/A                                                                                                                                                                                                  
Is XNNPACK available: True

Mar 18 '24 09:03 RickyYXY

Hi, I believe your batch size is too big. Can you try something smaller than 65536?

Mar 20 '24 12:03 danthe3rd

Hi, I believe your batch size is too big. Can you try something smaller than 65536?

I 'll try your advice later. Thks

Mar 21 '24 03:03 RickyYXY

But it's weird that my code can normally run without using xformers at the same batchsize, so I think it should not be the batchsize problem?

Mar 21 '24 03:03 RickyYXY

No, because the optimized kernels are built for specific sizes, and the maximum size anything is built for in sm_80 for ampere is 32768 because of the number of possible in-flight operations IIRC.

Mar 24 '24 07:03 NeedsMoar

Sorry, max bounds (from the source) are 65536, 128, 32 from the code, but it looks like you should be able to fit in that by reshaping the tensor. The kernels want power-of-two values it seems like, so 121 and 40 won't really fly. When running without xformers it may be less strict and just pad things out to the right sizes silently.

Mar 24 '24 08:03 NeedsMoar

Sorry, max bounds (from the source) are 65536, 128, 32 from the code, but it looks like you should be able to fit in that by reshaping the tensor. The kernels want power-of-two values it seems like, so 121 and 40 won't really fly. When running without xformers it may be less strict and just pad things out to the right sizes silently.

I get! Thanks for answering my question. I may try to adjust my code.

Mar 25 '24 01:03 RickyYXY

The reason is that CUDA kernels are parallelized across 3 dimensions at most, and we chose to parallelize across batch size in the y or z dimension (I don't recall exactly). This could be fixed probably but is low priority for us at the moment.

Blocks can be organized into one, two or three-dimensional grids of up to 2^31-1, 65,535 and 65,535 blocks in the x, y and z dimensions respectively

Source: https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)#Dimensions

Mar 25 '24 09:03 danthe3rd

xformers
xformers copied to clipboard

RuntimeError: CUDA error in running xformers with attention_mask arg

🐛 Bug

To Reproduce

Environment

xformers xformers copied to clipboard

RuntimeError: CUDA error in running xformers with attention_mask arg

🐛 Bug

To Reproduce

Environment

xformers
xformers copied to clipboard