How to debug CUDNN_STATUS_EXECUTION_FAILED?
I'm running my code with:
env CUDNN_LOGERR_DBG=1 CUDNN_LOGDEST_DBG=stderr torchrun --standalone --nproc_per_node=8 -m extra_scripts.model_playground_train
and getting errors like:
[rank5]: RuntimeError: /home/ved/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:358 in function fused_attn_arbitrary_seqlen_fwd_impl: cuDNN Error: execute(handle, plan->get_raw_desc(), variant_pack_descriptor.get_ptr()) failed with code: CUDNN_STATUS_EXECUTION_FAILED, and message: form_kernel_args(rtc, kernelParamFlatBuf.data(), var, arg_ptrs, stream). For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.
I'm using a pretty standard DotProductAttention:
self.te_attn = te.DotProductAttention(
num_attention_heads=24,
kv_channels=self.128,
qkv_format="thd", # tokens, head, dim
attn_mask_type="padding",
)
and I'm also calling it in a pretty standard way (all the assertions pass):
assert qkv.shape == (total, 3, self.num_heads, self.head_dim)
q, k, v = torch.unbind(qkv, dim=1)
assert q.shape == k.shape == v.shape
assert q.shape == (total, self.num_heads, self.head_dim)
assert cu_seqlens.shape[0] == B + 1
xy: torch.Tensor = self.te_attn(
q, k, v,
cu_seqlens_q=cu_seqlens,
cu_seqlens_kv=cu_seqlens,
max_seqlen_q=max_seqlen_in_batch,
max_seqlen_kv=max_seqlen_in_batch,
)
I'm kind of stuck on how to debug this. Seems like something is wrong with reading the inputs? Not sure. How should I proceed in debugging this?
Is there some chance that I need to use a specific stride? I know my shapes are correct, but it's definitely possible my stride is wrong.
@vedantroy Could you post more information about your environment - most importantly TE, CUDA and cuDNN versions. Also, could you try the failing case with CUDNN_LOGLEVEL_DBG=3 rather than CUDNN_LOGERR_DBG=1 and post a snippet of the log before the error? It should list the cuDNN call it is trying to execute, including the shapes and strides.
CUDA version:
my-compute-node:~/training/replay$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
CuDNN + transformer engine versions:
transformer_engine 1.8.0+3ec998e
nvidia-cudnn-cu12 9.1.0.70
More logs using the command env CUDNN_LOGERR_DBG=3 CUDNN_LOGDEST_DBG=stderr torchrun --standalone --nproc_per_node=8 -m extra_scripts.model_playground_train 2>log.txt
E! CuDNN (v90100 70) function cudnnBackendExecute() called:
e! Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: form_kernel_args(rtc, kernelParamFlatBuf.data(), var, arg_ptrs, stream)
e! Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: plan.getEnginePtr()->execute(vars, handle->streamId)
e! Time: 2024-08-16T23:10:17.196169 (0d+0h+0m+3s since start)
e! Process=2088716; Thread=2088716; GPU=NULL; Handle=NULL; StreamId=NULL.
Ok, further updates. It looks like it's failing on the backwards pass only. And ... if I use only 2 layers in my model, instead of 4, it doesn't fail. Is it possible I'm getting Cuda OOM issues? (Seems unlikely since I run this model w/ 48+ layers when using FA2).
Hi @vedantroy , I tried to reproduce your config, and it seemed to pass my tests.
Arch: Hopper
Container: nvcr.io/nvidia/pytorch:24.07-py3 (CUDA 12.5.1.007)
TE 1.8: https://github.com/NVIDIA/TransformerEngine/archive/refs/tags/v1.8.zip
cuDNN 9.1: https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.1.0.70_cuda12-archive.tar.xz
tests:
model_configs_layout_thd = {
# test: b, h, hg, d, sq, skv, p, mask, bias
"layout_0_1": ModelConfig(1, 24, 24, 128, 128, 128, 0.0, "padding", "no_bias"),
}
pytest -s -v tests/pytorch/fused_attn/test_fused_attn.py::test_dpa_qkv_layout_thd
Could you extract a small reproducer code with just the DotProductAttention calls from your application? Maybe we can have a look at how that's different from my tests.
Thanks, Charlene
@cyanguwa -- I'll try to make a minimal reproduction soon. For now, a few more details
- Only happens w/ FSDP enabled on multiple ranks
- Does not happen if
os.environ["NVTE_FUSED_ATTN"] = "0"
Also facing the error
I am using Transformer Engine - 1.10.0+08a85d3
rank13]: File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 6915, in forward
[rank13]: return self.fused_attention(
[rank13]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank13]: return self._call_impl(*args, **kwargs)
[rank13]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank13]: return forward_call(*args, **kwargs)
[rank13]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank13]: return fn(*args, **kwargs)
[rank13]: File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 6032, in forward
[rank13]: output = FusedAttnFunc.apply(
[rank13]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply
[rank13]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank13]: File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 5428, in forward
[rank13]: out_ret, aux_ctx_tensors = fused_attn_fwd(
[rank13]: File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 1006, in fused_attn_fwd
[rank13]: output_tensors = tex.fused_attn_fwd(
[rank13]: RuntimeError: /tmp/pip-req-build-yu5jl144/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:378 in function fused_attn_arbitrary_seqlen_fwd_impl: cuDNN Error: execute(handle, plan->get_raw_desc(), variant_pack_descriptor.get_ptr()) failed with message: , and code: CUDNN_STATUS_EXECUTION_FAILED. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment./
use same data or fixed data?
@tgkul @vedantroy