TransformerEngine How to debug `tex.fused_attn_bwd` getting `cuDNN Error: [cudnn_frontend] Error: No execution plans support the graph`

Describe the bug

Fused attention backward gets RuntimeError with no informative message. Setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr don't help.

Error Msg

Traceback (most recent call last):
    File "./test.py", line 25, in <module>
      output_fused.backward(out_grad)
    File "/xxx/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
      torch.autograd.backward(
    File "/xxx/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
      _engine_run_backward(
    File "/xxx/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
      return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/xxx/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 307, in apply
      return user_fn(self, *args)
             ^^^^^^^^^^^^^^^^^^^^
    File "/xxx/.venv/lib/python3.12/site-packages/transformer_engine/pytorch/attention.py", line 6340, in backward
      dq, dk, dv, *rest = fused_attn_bwd(
                          ^^^^^^^^^^^^^^^
    File "/xxx/.venv/lib/python3.12/site-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 451, in fused_attn_bwd
      output_tensors = tex.fused_attn_bwd(
                       ^^^^^^^^^^^^^^^^^^^
  RuntimeError: /xxxx/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:771 in function operator(): cuDNN Error: [cudnn_frontend] Error: No execution plans support the graph.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.

Steps/Code to reproduce bug

`test.py`

import os
import torch

from transformer_engine.pytorch.attention import DotProductAttention, _attention_backends

seqlen, batch_size, heads, kv_channels = 1024, 2, 16, 192

q, k = [torch.randn(seqlen, batch_size, heads, kv_channels, dtype=torch.float16, device="cuda", requires_grad=True) for _ in range(2)]
v = torch.randn(seqlen, batch_size, heads, 128, dtype=torch.float16, device="cuda", requires_grad=True)

cu_seqlens_q = cu_seqlens_kv = torch.tensor([0, 1024, 2048], device="cuda", dtype=torch.int32)

attention_kernel = DotProductAttention(heads, (192, 128))


os.environ["NVTE_FUSED_ATTN"] = "1"
os.environ["NVTE_FLASH_ATTN"] = "0"
_attention_backends["backend_selection_requires_update"] = True
output_fused = attention_kernel(q, k, v, qkv_format='sbhd', attn_mask_type='causal', cu_seqlens_q=cu_seqlens_q, cu_seqlens_kv=cu_seqlens_kv)
print(output_fused.shape)
out_grad = 0.001 * torch.randint(0, 200, (1024, 2, 2048), device="cuda")
output_fused.backward(out_grad)

CUDNN_LOGERR_DBG=1 CUDNN_LOGDEST_DBG=stderr NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=1 python test.py

Expected behavior

Expects backward fluently.

Environment overview (please complete the following information)

H100, Cudnn 9.1.0, CUDA 12.3, python 3.12.6, pytorch 2.6.0+cu124

TE installed via compilation with uv using https://github.com/NVIDIA/TransformerEngine/commit/8eb17125d36d1886f4c3fb14ca4184f0239b7c06

compilation script

CMAKE_BUILD_WITH_INSTALL_RPATH=ON \
CMAKE_INSTALL_RPATH_USE_LINK_PATH=ON \
CMAKE_SKIP_BUILD_RPATH=FALSE \
CMAKE_BUILD_WITH_INSTALL_RPATH=TRUE \
CMAKE_INSTALL_RPATH="/xxx/.venv/lib/python3.12/site-packages/nvidia/cudnn/lib/" \
CUDNN_PATH=/xxx/.venv/lib/python3.12/site-packages/nvidia/cudnn/ \
CUDACXX=/usr/local/cuda-12.3/bin/nvcc \
CMAKE_CUDA_COMPILER=/usr/local/cuda-12.3/bin/nvcc \
CUDA_HOME=/usr/local/cuda-12.3 \
NVTE_FRAMEWORK=pytorch \
MAX_JOBS=96 \
CC=gcc \
CXX=g++ \
CMAKE_GENERATOR="Unix Makefiles" \
SKBUILD_CMAKE_ARGS="-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON" \
uv pip install -v "." --no-build-isolation --no-cache-dir

Mar 19 '25 20:03 Ir1d

I encountered the same problem. It seems that fused attention does not support scenes with different kv head dimensions. My approach is to pad v to 192 to use fused or flash attention. Hope TE can support attention with different kv dimensions in the future.

Mar 21 '25 07:03 liangxuZhang

Adding @cyanguwa.

Apr 25 '25 18:04 ptrendx

Fused attention, i.e. cuDNN attention, does support different kv head dimensions. The error here is due to the lack of support for bprop with head dimension > 128. Please upgrade to cuDNN 9.5+ in order to use this feature.

NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 python test.py 
[DEBUG    | DotProductAttention]: Running with config={'transformer_engine_version': '2.4.0.dev0+94bff099', 'compute_capability': 'sm90', 'flash_attn_version': '2.7.3', 'flash_attn_3_version': 'not installed', 'cudnn_version': '9.5.0', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.float16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 2, 'num_heads': 16, 'num_gqa_groups': 16, 'max_seqlen_q': 1024, 'max_seqlen_kv': 1024, 'head_dim_qk': 192, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None}
[DEBUG    | DotProductAttention]: Disabling FlashAttention 2 due to NVTE_FLASH_ATTN=0
[DEBUG    | DotProductAttention]: Available backends = {FlashAttention=False, FusedAttention=True (sub-backend 1), UnfusedDotProductAttention=True}
[DEBUG    | DotProductAttention]: Selected backend = FusedAttention (sub-backend 1)
[INFO     | DotProductAttention]: Running with FusedAttention backend (sub-backend 1)
torch.Size([1024, 2, 2048])

Apr 25 '25 19:04 cyanguwa

@cyanguwa Probably what we should do here though is extend our support check - if the backward is going to be used (because the inputs require gradient) then we should check that both fwd and bwd is supported before choosing fused attention.

Apr 25 '25 20:04 ptrendx

Fused attention, i.e. cuDNN attention, does support different kv head dimensions. The error here is due to the lack of support for bprop with head dimension > 128. Please upgrade to cuDNN 9.5+ in order to use this feature.

NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 python test.py 
[DEBUG    | DotProductAttention]: Running with config={'transformer_engine_version': '2.4.0.dev0+94bff099', 'compute_capability': 'sm90', 'flash_attn_version': '2.7.3', 'flash_attn_3_version': 'not installed', 'cudnn_version': '9.5.0', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.float16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 2, 'num_heads': 16, 'num_gqa_groups': 16, 'max_seqlen_q': 1024, 'max_seqlen_kv': 1024, 'head_dim_qk': 192, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None}
[DEBUG    | DotProductAttention]: Disabling FlashAttention 2 due to NVTE_FLASH_ATTN=0
[DEBUG    | DotProductAttention]: Available backends = {FlashAttention=False, FusedAttention=True (sub-backend 1), UnfusedDotProductAttention=True}
[DEBUG    | DotProductAttention]: Selected backend = FusedAttention (sub-backend 1)
[INFO     | DotProductAttention]: Running with FusedAttention backend (sub-backend 1)
torch.Size([1024, 2, 2048])

That’s so cool—this has been a huge help to me!

Jun 22 '25 13:06 xdobetter

I've added is_training as a parameter when checking backend support, so we can distinguish training and inference.

Anyway, hope the issue has been resolved! Closing the ticket for now. Thanks.

Jun 26 '25 18:06 cyanguwa

Running with config={'transformer_engine_version': '2.2.0+d0c452cc', 'compute_capability': 'sm90', 'flash_attn_version': 'not installed', 'flash_attn_3_version': 'not installed', 'cudnn_version': '9.8.0', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'thd_thd_thd', 'batch_size': 2, 'num_heads': 1, 'num_gqa_groups': 1, 'max_seqlen_q': 2176, 'max_seqlen_kv': 2176, 'head_dim_qk': 192, 'head_dim_v': 128, 'attn_mask_type': 'padding_causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None} 15:25:40.094 [36m(WorkerDict pid=21223, ip=33.17.207.154)[0m DEBUG:2025-11-05 15:25:39,883:Disabling UnfusedDotProductAttention for qkv_format = thd 15:25:40.094 [36m(WorkerDict pid=21223, ip=33.17.207.154)[0m DEBUG:2025-11-05 15:25:39,887:Disabling FusedAttention as no backend supports the provided input 15:25:40.094 [36m(WorkerDict pid=21223, ip=33.17.207.154)[0m DEBUG:2025-11-05 15:25:39,887:Available backends = {FlashAttention=False, FusedAttention=False, UnfusedDotProductAttention=False} 15:25:40.094 [36m(WorkerDict pid=21223, ip=33.17.207.154)[0m DEBUG:2025-11-05 15:25:39,887:Selected backend = NoBackend

My cuDNN version is already 9.8, but there's still an error. What's going on

Nov 05 '25 09:11 fangjiayueyuan