TransformerEngine How to debug CUDNN_STATUS_EXECUTION

I'm running my code with:

env CUDNN_LOGERR_DBG=1  CUDNN_LOGDEST_DBG=stderr torchrun --standalone --nproc_per_node=8 -m extra_scripts.model_playground_train

and getting errors like:

[rank5]: RuntimeError: /home/ved/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:358 in function fused_attn_arbitrary_seqlen_fwd_impl: cuDNN Error: execute(handle, plan->get_raw_desc(), variant_pack_descriptor.get_ptr()) failed with code: CUDNN_STATUS_EXECUTION_FAILED, and message: form_kernel_args(rtc, kernelParamFlatBuf.data(), var, arg_ptrs, stream). For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.

I'm using a pretty standard DotProductAttention:

        self.te_attn = te.DotProductAttention(
                num_attention_heads=24,
                kv_channels=self.128,
                qkv_format="thd", # tokens, head, dim
                attn_mask_type="padding",        
       )

and I'm also calling it in a pretty standard way (all the assertions pass):

                assert qkv.shape == (total, 3, self.num_heads, self.head_dim)
                q, k, v = torch.unbind(qkv, dim=1)

                assert q.shape == k.shape == v.shape
                assert q.shape == (total, self.num_heads, self.head_dim)
                assert cu_seqlens.shape[0] == B + 1

                xy: torch.Tensor = self.te_attn(
                    q, k, v,
                    cu_seqlens_q=cu_seqlens,
                    cu_seqlens_kv=cu_seqlens,
                    max_seqlen_q=max_seqlen_in_batch,
                    max_seqlen_kv=max_seqlen_in_batch,
                )

I'm kind of stuck on how to debug this. Seems like something is wrong with reading the inputs? Not sure. How should I proceed in debugging this?

Aug 15 '24 23:08 vedantroy

Is there some chance that I need to use a specific stride? I know my shapes are correct, but it's definitely possible my stride is wrong.

Aug 16 '24 00:08 vedantroy

@vedantroy Could you post more information about your environment - most importantly TE, CUDA and cuDNN versions. Also, could you try the failing case with CUDNN_LOGLEVEL_DBG=3 rather than CUDNN_LOGERR_DBG=1 and post a snippet of the log before the error? It should list the cuDNN call it is trying to execute, including the shapes and strides.

Aug 16 '24 16:08 ptrendx

CUDA version:

my-compute-node:~/training/replay$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

CuDNN + transformer engine versions:

transformer_engine            1.8.0+3ec998e
nvidia-cudnn-cu12             9.1.0.70

More logs using the command env CUDNN_LOGERR_DBG=3 CUDNN_LOGDEST_DBG=stderr torchrun --standalone --nproc_per_node=8 -m extra_scripts.model_playground_train 2>log.txt

E! CuDNN (v90100 70) function cudnnBackendExecute() called:
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: form_kernel_args(rtc, kernelParamFlatBuf.data(), var, arg_ptrs, stream)
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: plan.getEnginePtr()->execute(vars, handle->streamId)
e! Time: 2024-08-16T23:10:17.196169 (0d+0h+0m+3s since start)
e! Process=2088716; Thread=2088716; GPU=NULL; Handle=NULL; StreamId=NULL.

Aug 16 '24 23:08 vedantroy

Ok, further updates. It looks like it's failing on the backwards pass only. And ... if I use only 2 layers in my model, instead of 4, it doesn't fail. Is it possible I'm getting Cuda OOM issues? (Seems unlikely since I run this model w/ 48+ layers when using FA2).

Aug 17 '24 01:08 vedantroy

Hi @vedantroy , I tried to reproduce your config, and it seemed to pass my tests.

Arch: Hopper
Container: nvcr.io/nvidia/pytorch:24.07-py3 (CUDA 12.5.1.007)
TE 1.8: https://github.com/NVIDIA/TransformerEngine/archive/refs/tags/v1.8.zip
cuDNN 9.1: https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.1.0.70_cuda12-archive.tar.xz
tests:
model_configs_layout_thd = {
    #       test:             b,  h, hg,   d,   sq,  skv,   p,             mask,             bias
    "layout_0_1": ModelConfig(1, 24, 24, 128, 128, 128, 0.0, "padding", "no_bias"),
}
pytest -s -v tests/pytorch/fused_attn/test_fused_attn.py::test_dpa_qkv_layout_thd

Could you extract a small reproducer code with just the DotProductAttention calls from your application? Maybe we can have a look at how that's different from my tests.

Thanks, Charlene

Aug 19 '24 21:08 cyanguwa

@cyanguwa -- I'll try to make a minimal reproduction soon. For now, a few more details

Only happens w/ FSDP enabled on multiple ranks
Does not happen if

os.environ["NVTE_FUSED_ATTN"] = "0"

Aug 19 '24 23:08 vedantroy

Also facing the error

I am using Transformer Engine - 1.10.0+08a85d3

rank13]:   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 6915, in forward
[rank13]:     return self.fused_attention(
[rank13]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank13]:     return self._call_impl(*args, **kwargs)
[rank13]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank13]:     return forward_call(*args, **kwargs)
[rank13]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank13]:     return fn(*args, **kwargs)
[rank13]:   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 6032, in forward
[rank13]:     output = FusedAttnFunc.apply(
[rank13]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply
[rank13]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank13]:   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 5428, in forward
[rank13]:     out_ret, aux_ctx_tensors = fused_attn_fwd(
[rank13]:   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 1006, in fused_attn_fwd
[rank13]:     output_tensors = tex.fused_attn_fwd(
[rank13]: RuntimeError: /tmp/pip-req-build-yu5jl144/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:378 in function fused_attn_arbitrary_seqlen_fwd_impl: cuDNN Error: execute(handle, plan->get_raw_desc(), variant_pack_descriptor.get_ptr()) failed with message: , and code: CUDNN_STATUS_EXECUTION_FAILED. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment./

Sep 26 '24 13:09 tgkul

use same data or fixed data?

Oct 24 '24 14:10 xuexiao1987

@tgkul @vedantroy

Oct 24 '24 14:10 xuexiao1987

How to debug CUDNN_STATUS_EXECUTION_FAILED?