TransformerEngine
TransformerEngine copied to clipboard
TransformerEngine v1.2.1 throws CuDNN frontend error on H100 GPU (AWS p5.48xlarge instance)
Hi, we are currently running into TransformerEngine related error when running GPT model on H100 GPU (AWS p5.48xlarge). Below is the error log
Error:
RuntimeErrorRuntimeError: RuntimeError/fsx/sbuasai/test_te/deps/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:227 in function operator(): cuDNN Error: [cudnn_frontend] Error: No execution plans built successfully.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.RuntimeError
: : : /fsx/sbuasai/test_te/deps/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:227 in function operator(): cuDNN Error: [cudnn_frontend] Error: No execution plans built successfully.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment./fsx/sbuasai/test_te/deps/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:227 in function operator(): cuDNN Error: [cudnn_frontend] Error: No execution plans built successfully.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment. /fsx/sbuasai/test_te/deps/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:227 in function operator(): cuDNN Error: [cudnn_frontend] Error: No execution plans built successfully.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.
output_tensors = tex.fused_attn_fwd(
return fn(*args, **kwargs)
File "/home/ubuntu/.conda/envs/megatron_bench/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 1835, in forward
RuntimeError: /fsx/sbuasai/test_te/deps/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:227 in function operator(): cuDNN Error: [cudnn_frontend] Error: No execution plans built successfully.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.
out, aux_ctx_tensors = fused_attn_fwd(
File "/home/ubuntu/.conda/envs/megatron_bench/lib/python3.10/site-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 811, in fused_attn_fwd
output = FusedAttnFunc.apply(
File "/home/ubuntu/.conda/envs/megatron_bench/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
output_tensors = tex.fused_attn_fwd(
RuntimeError: /fsx/sbuasai/test_te/deps/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:227 in function operator(): cuDNN Error: [cudnn_frontend] Error: No execution plans built successfully.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/ubuntu/.conda/envs/megatron_bench/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 1616, in forward
output_tensors = tex.fused_attn_fwd(
RuntimeError: /fsx/sbuasai/test_te/deps/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:227 in function operator(): cuDNN Error: [cudnn_frontend] Error: No execution plans built successfully.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.
out, aux_ctx_tensors = fused_attn_fwd(
File "/home/ubuntu/.conda/envs/megatron_bench/lib/python3.10/site-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 811, in fused_attn_fwd
output_tensors = tex.fused_attn_fwd(
RuntimeError: /fsx/sbuasai/test_te/deps/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu:227 in function operator(): cuDNN Error: [cudnn_frontend] Error: No execution plans built successfully.. For more information, enable cuDNN error logging by setting CUDNN_LOGERR_DBG=1 and CUDNN_LOGDEST_DBG=stderr in the environment.
[2024-02-02 01:08:10,718] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 24721 closing signal SIGTERM
[2024-02-02 01:08:10,750] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 24718) of binary: /home/ubuntu/.conda/envs/megatron_bench/bin/python
Steps to reproduce:
- Create conda env with
conda env create -f megatron_bench.yml
andconda activate megatron_bench
- Install flash-attnetion, TransformerEngine, apex, and Megatron-LM from source declared in
install_deps.sh
. - Update the path to the data in
train.sh
. - Run training using script using
train.sh
.
megatron_bench.yml
:
name: megatron_bench
channels:
- pytorch
- nvidia
dependencies:
- python=3.10
- pip
- conda:
- python=3.10
- pytorch=2.1.2
- pytorch-cuda=12.1
- torchvision
- torchaudio
install_deps.sh
:
#!/bin/bash
set -e
# ===================================================
# Set dependencies pin
# ===================================================
FLASH_ATTN_BRANCH='v2.0.4'
TE_BRANCH='v1.2.1'
APEX_HASH='6c8f384b40a596bbed960f5e8d9a808ebd0e93d8'
MEGATRON_LM_HASH='2c3468a49ed51324ae9b442e0d88416f1b29422b'
# ===================================================
# Install megatron python dependencies
# ===================================================
conda install -y regex astunparse ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests libcurl dataclasses packaging
pip install six regex tensorboardX daal4py deepspeed pyarrow pybind11 numpy==1.23.5
# ===================================================
# Install flash-attention
# ===================================================
cd $DEPS_DIR
if [ ! -d "$DEPS_DIR/flash-attention" ]; then
git clone -b ${FLASH_ATTN_BRANCH} https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
python setup.py install
fi
# ===================================================
# Install TransformerEngine
# ===================================================
cd $DEPS_DIR
if [ ! -d "$DEPS_DIR/TransformerEngine" ]; then
git clone --branch stable --recursive https://github.com/NVIDIA/TransformerEngine.git
cd TransformerEngine
git checkout ${TE_BRANCH}
git submodule update --init --recursive
export NVTE_FRAMEWORK="pytorch"
export CUDNN_PATH=/usr/local/cuda-12.1
export CUDNN_INCLUDE_DIR=/usr/local/cuda-12.1/include
pip install .
fi
# ===================================================
# Install apex
# ===================================================
cd $DEPS_DIR
if [ ! -d "$DEPS_DIR/apex" ]; then
git clone https://github.com/NVIDIA/apex.git
cd apex
git checkout ${APEX_HASH}
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
--config-settings "--global-option=--cpp_ext" \
--config-settings "--global-option=--cuda_ext" \
./
fi
# ===================================================
# Clone Megatron-LM for scripts
# ===================================================
cd $DEPS_DIR
if [ ! -d "$DEPS_DIR/Megatron-LM" ]; then
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout ${MEGATRON_LM_HASH}
cd $DEPS_DIR
fi
train.sh
:
#!/bin/bash
DEPS_DIR="$(pwd)/deps"
DATASET_DIR="<DECLARE YOUR DATASET DIRECTORY HERE>"
bash install_deps.sh
export GPT_HOME="${DATASET_DIR}"
export DATASET="${DATASET_DIR}/my-gpt2_text_document/my-gpt2_text_document"
export CHECKPOINT_PATH="${DATASET_DIR}/checkpoints/gptmodel"
export VOCAB_FILE="${DATASET_DIR}/gpt2-vocab.json"
export MERGES_FILE="${DATASET_DIR}/gpt2-merges.txt"
export DATA_PATH="${DATASET_DIR}/my-gpt2_text_document/my-gpt2_text_document"
export NVTE_BIAS_GELU_NVFUSION=0
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_DEBUG=INFO
export NCCL_PROTO=LL,LL128,Simple
export FI_PROVIDER=efa
export FI_EFA_USE_DEVICE_RDMA=1
export RDMAV_FORK_SAFE=1
# remove previous checkpoints
rm -rf ${DATASET_DIR}/checkpoints/
torchrun --nproc-per-node 8 --nnodes 1 \
${DEPS_DIR}/Megatron-LM/pretrain_gpt.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--sequence-parallel \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 1 \
--global-batch-size 8 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--train-iters 1200 \
--lr-decay-iters 320000 \
--save ${CHECKPOINT_PATH} \
--load ${CHECKPOINT_PATH} \
--data-path ${DATA_PATH} \
--vocab-file ${VOCAB_FILE} \
--merge-file ${MERGES_FILE} \
--split 949,50,1 \
--distributed-backend nccl \
--lr 0.00015 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--log-interval 100 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--init-method-std 0.006 \
--bf16 \
--transformer-impl transformer_engine \
--attention-softmax-in-fp32
Hi @sirutBuasai, what is the cuDNN version you are using?
CuDNN installed with torch==2.1.2
is 8.9.2
(megatron_bench) ubuntu@ip-10-0-0-88:~$ python -c "import torch;print(torch.backends.cudnn.version())"
8902
Hi @sirutBuasai , could you try upgrading to cuDNN 8.9.7+ please?
Will do, in the meantime, is there a TE version that is built with CuDNN 8.9.2?
I think it's probably v0.10, but I'd rather you roll forward with cuDNN than backward with TE. There's been a lot of development in the last year or so. If it's easier, you can use the NGC pytorch container, which has the latest TE (1.3) and cuDNN (9.0): nvcr.io/nvidia/pytorch:24.01-py3
@cyanguwa I think we still should catch this error from cuDNN Frontend and just disable cuDNN's implementation of attention in this case.
@sirutBuasai Was your problem solved? Could you tell me the solution. I meet the same problem.
@liu21yd, We ended up using TE v0.10 but it is pretty old. I haven't tried upgrading CuDNN and TE together but that would be a place to start.
Recently we observed similar issues with any combinations of TE 1.4/1.7 and cuDNN 8.9.4/8.9.7. In our cases, the fused_attn test in this repository also fails, as well as the frontend toolkit (Megatron-LM) doesn't work.
Note that our operating system is Rocky, not Debian-ish ones.
For a workaround we eventually set NVTE_FUSED_ATTN=0
to disable fused attention kernels, then the issue went away.