xformers
xformers copied to clipboard
RuntimeError: CUDA error in running xformers with attention_mask arg
🐛 Bug
I call func _memory_efficient_attention_xformers to calc results with attention_mask arg, and my attention_mask is a torch.Tensor. While runing, it gave a RuntimeError: CUDA error: invalid configuration argument error.
To Reproduce
My input tensors' size are
query : shape=(262144, 1, 1, 40) (torch.float16)
key : shape=(262144, 121, 1, 40) (torch.float16)
value : shape=(262144, 121, 1, 40) (torch.float16)
attn_bias : <class 'torch.Tensor'>
The third dim 1 is added by xformer, and my attn_bias actually is (262144, 1, 121). It first show that
HINT: To use an `attn_bias` with a sequence length that is not a multiple of 8, you need to ensure memory is aligned by slicing a bigger tensor. Example: use `attn_bias = torch.zeros([1, 1, 5, 8])[:,:,:,:5]` instead o
f `torch.zeros([1, 1, 5, 5])`
. After I followed this and changed attn_bias to [262144, 1, 128]->[262144, 1, :121], it cames the CUDA error. Here is the complete error message:
File "/mnt/data/yuxingyuan/MotionControl/CameraCtrl/animatediff/models/motion_module.py", line 597, in forward
hidden_states = self._memory_efficient_attention_xformers(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/diffusers/models/attention.py", line 728, in _memory_efficient_attention_xformers
hidden_states = xformers.ops.memory_efficient_attention(query, key, value, attn_bias=attention_mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/__init__.py", line 223, in memory_efficient_attention
return _memory_efficient_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/__init__.py", line 326, in _memory_efficient_attention
return _fMHA.apply(
^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/__init__.py", line 42, in forward
out, op_ctx = _memory_efficient_attention_forward_requires_grad(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/__init__.py", line 354, in _memory_efficient_attention_forward_requires_grad
out = op.apply(inp, needs_gradient=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/cutlass.py", line 202, in apply
return cls.apply_bmhk(inp, needs_gradient=needs_gradient)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/xformers/ops/fmha/cutlass.py", line 266, in apply_bmhk
out, lse, rng_seed, rng_offset = cls.OPERATOR(
^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/camctrl/lib/python3.11/site-packages/torch/_ops.py", line 755, in __call__
return self._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: invalid configuration argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Environment
My XFormers verision is 0.0.24. Other version info is shown below:
PyTorch version: 2.2.0
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.28.20231226-gb2ea53f
Libc version: glibc-2.31
Python version: 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-144-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800-SXM4-80GB
Nvidia driver version: 525.147.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Hi,
I believe your batch size is too big. Can you try something smaller than 65536
?
Hi, I believe your batch size is too big. Can you try something smaller than
65536
?
I 'll try your advice later. Thks
But it's weird that my code can normally run without using xformers at the same batchsize, so I think it should not be the batchsize problem?
No, because the optimized kernels are built for specific sizes, and the maximum size anything is built for in sm_80 for ampere is 32768 because of the number of possible in-flight operations IIRC.
Sorry, max bounds (from the source) are 65536, 128, 32 from the code, but it looks like you should be able to fit in that by reshaping the tensor. The kernels want power-of-two values it seems like, so 121 and 40 won't really fly. When running without xformers it may be less strict and just pad things out to the right sizes silently.
Sorry, max bounds (from the source) are 65536, 128, 32 from the code, but it looks like you should be able to fit in that by reshaping the tensor. The kernels want power-of-two values it seems like, so 121 and 40 won't really fly. When running without xformers it may be less strict and just pad things out to the right sizes silently.
I get! Thanks for answering my question. I may try to adjust my code.
The reason is that CUDA kernels are parallelized across 3 dimensions at most, and we chose to parallelize across batch size in the y or z dimension (I don't recall exactly). This could be fixed probably but is low priority for us at the moment.
Blocks can be organized into one, two or three-dimensional grids of up to 2^31-1, 65,535 and 65,535 blocks in the x, y and z dimensions respectively
Source: https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)#Dimensions