How to debug `tex.fused_attn_bwd` getting `cuDNN Error: [cudnn_frontend] Error: No execution plans support the graph`
Describe the bug
Fused attention backward gets RuntimeError with no informative message. Setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr don't help.
Error Msg
Traceback (most recent call last):
File "./test.py", line 25, in <module>
output_fused.backward(out_grad)
File "/xxx/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
torch.autograd.backward(
File "/xxx/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
_engine_run_backward(
File "/xxx/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 307, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/xxx/.venv/lib/python3.12/site-packages/transformer_engine/pytorch/attention.py", line 6340, in backward
dq, dk, dv, *rest = fused_attn_bwd(
^^^^^^^^^^^^^^^
File "/xxx/.venv/lib/python3.12/site-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 451, in fused_attn_bwd
output_tensors = tex.fused_attn_bwd(
^^^^^^^^^^^^^^^^^^^
RuntimeError: /xxxx/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:771 in function operator(): cuDNN Error: [cudnn_frontend] Error: No execution plans support the graph.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.
Steps/Code to reproduce bug
`test.py`
import os
import torch
from transformer_engine.pytorch.attention import DotProductAttention, _attention_backends
seqlen, batch_size, heads, kv_channels = 1024, 2, 16, 192
q, k = [torch.randn(seqlen, batch_size, heads, kv_channels, dtype=torch.float16, device="cuda", requires_grad=True) for _ in range(2)]
v = torch.randn(seqlen, batch_size, heads, 128, dtype=torch.float16, device="cuda", requires_grad=True)
cu_seqlens_q = cu_seqlens_kv = torch.tensor([0, 1024, 2048], device="cuda", dtype=torch.int32)
attention_kernel = DotProductAttention(heads, (192, 128))
os.environ["NVTE_FUSED_ATTN"] = "1"
os.environ["NVTE_FLASH_ATTN"] = "0"
_attention_backends["backend_selection_requires_update"] = True
output_fused = attention_kernel(q, k, v, qkv_format='sbhd', attn_mask_type='causal', cu_seqlens_q=cu_seqlens_q, cu_seqlens_kv=cu_seqlens_kv)
print(output_fused.shape)
out_grad = 0.001 * torch.randint(0, 200, (1024, 2, 2048), device="cuda")
output_fused.backward(out_grad)
CUDNN_LOGERR_DBG=1 CUDNN_LOGDEST_DBG=stderr NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=1 python test.py
Expected behavior
Expects backward fluently.
Environment overview (please complete the following information)
H100, Cudnn 9.1.0, CUDA 12.3, python 3.12.6, pytorch 2.6.0+cu124
TE installed via compilation with uv using https://github.com/NVIDIA/TransformerEngine/commit/8eb17125d36d1886f4c3fb14ca4184f0239b7c06
compilation script
CMAKE_BUILD_WITH_INSTALL_RPATH=ON \
CMAKE_INSTALL_RPATH_USE_LINK_PATH=ON \
CMAKE_SKIP_BUILD_RPATH=FALSE \
CMAKE_BUILD_WITH_INSTALL_RPATH=TRUE \
CMAKE_INSTALL_RPATH="/xxx/.venv/lib/python3.12/site-packages/nvidia/cudnn/lib/" \
CUDNN_PATH=/xxx/.venv/lib/python3.12/site-packages/nvidia/cudnn/ \
CUDACXX=/usr/local/cuda-12.3/bin/nvcc \
CMAKE_CUDA_COMPILER=/usr/local/cuda-12.3/bin/nvcc \
CUDA_HOME=/usr/local/cuda-12.3 \
NVTE_FRAMEWORK=pytorch \
MAX_JOBS=96 \
CC=gcc \
CXX=g++ \
CMAKE_GENERATOR="Unix Makefiles" \
SKBUILD_CMAKE_ARGS="-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON" \
uv pip install -v "." --no-build-isolation --no-cache-dir
I encountered the same problem. It seems that fused attention does not support scenes with different kv head dimensions. My approach is to pad v to 192 to use fused or flash attention. Hope TE can support attention with different kv dimensions in the future.
Adding @cyanguwa.
Fused attention, i.e. cuDNN attention, does support different kv head dimensions. The error here is due to the lack of support for bprop with head dimension > 128. Please upgrade to cuDNN 9.5+ in order to use this feature.
NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 python test.py
[DEBUG | DotProductAttention]: Running with config={'transformer_engine_version': '2.4.0.dev0+94bff099', 'compute_capability': 'sm90', 'flash_attn_version': '2.7.3', 'flash_attn_3_version': 'not installed', 'cudnn_version': '9.5.0', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.float16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 2, 'num_heads': 16, 'num_gqa_groups': 16, 'max_seqlen_q': 1024, 'max_seqlen_kv': 1024, 'head_dim_qk': 192, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None}
[DEBUG | DotProductAttention]: Disabling FlashAttention 2 due to NVTE_FLASH_ATTN=0
[DEBUG | DotProductAttention]: Available backends = {FlashAttention=False, FusedAttention=True (sub-backend 1), UnfusedDotProductAttention=True}
[DEBUG | DotProductAttention]: Selected backend = FusedAttention (sub-backend 1)
[INFO | DotProductAttention]: Running with FusedAttention backend (sub-backend 1)
torch.Size([1024, 2, 2048])
@cyanguwa Probably what we should do here though is extend our support check - if the backward is going to be used (because the inputs require gradient) then we should check that both fwd and bwd is supported before choosing fused attention.
Fused attention, i.e. cuDNN attention, does support different kv head dimensions. The error here is due to the lack of support for bprop with head dimension > 128. Please upgrade to cuDNN 9.5+ in order to use this feature.
NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 python test.py [DEBUG | DotProductAttention]: Running with config={'transformer_engine_version': '2.4.0.dev0+94bff099', 'compute_capability': 'sm90', 'flash_attn_version': '2.7.3', 'flash_attn_3_version': 'not installed', 'cudnn_version': '9.5.0', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.float16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 2, 'num_heads': 16, 'num_gqa_groups': 16, 'max_seqlen_q': 1024, 'max_seqlen_kv': 1024, 'head_dim_qk': 192, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None} [DEBUG | DotProductAttention]: Disabling FlashAttention 2 due to NVTE_FLASH_ATTN=0 [DEBUG | DotProductAttention]: Available backends = {FlashAttention=False, FusedAttention=True (sub-backend 1), UnfusedDotProductAttention=True} [DEBUG | DotProductAttention]: Selected backend = FusedAttention (sub-backend 1) [INFO | DotProductAttention]: Running with FusedAttention backend (sub-backend 1) torch.Size([1024, 2, 2048])
That’s so cool—this has been a huge help to me!
I've added is_training as a parameter when checking backend support, so we can distinguish training and inference.
Anyway, hope the issue has been resolved! Closing the ticket for now. Thanks.
Running with config={'transformer_engine_version': '2.2.0+d0c452cc', 'compute_capability': 'sm90', 'flash_attn_version': 'not installed', 'flash_attn_3_version': 'not installed', 'cudnn_version': '9.8.0', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'thd_thd_thd', 'batch_size': 2, 'num_heads': 1, 'num_gqa_groups': 1, 'max_seqlen_q': 2176, 'max_seqlen_kv': 2176, 'head_dim_qk': 192, 'head_dim_v': 128, 'attn_mask_type': 'padding_causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None} 15:25:40.094 [36m(WorkerDict pid=21223, ip=33.17.207.154)[0m DEBUG:2025-11-05 15:25:39,883:Disabling UnfusedDotProductAttention for qkv_format = thd 15:25:40.094 [36m(WorkerDict pid=21223, ip=33.17.207.154)[0m DEBUG:2025-11-05 15:25:39,887:Disabling FusedAttention as no backend supports the provided input 15:25:40.094 [36m(WorkerDict pid=21223, ip=33.17.207.154)[0m DEBUG:2025-11-05 15:25:39,887:Available backends = {FlashAttention=False, FusedAttention=False, UnfusedDotProductAttention=False} 15:25:40.094 [36m(WorkerDict pid=21223, ip=33.17.207.154)[0m DEBUG:2025-11-05 15:25:39,887:Selected backend = NoBackend
My cuDNN version is already 9.8, but there's still an error. What's going on