TransformerEngine [PyTorch] Unable to run FP8 example on 5090

Describe the bug

Running the PyTorch example code from the Transformer Engine documentation fails on 5090

Traceback (most recent call last):
  File "/workspace/fp8.py", line 22, in <module>
    loss.backward()
  File "/usr/local/lib/python3.12/dist-packages/torch/_tensor.py", line 648, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.12/dist-packages/torch/autograd/__init__.py", line 353, in backward
    _engine_run_backward(
  File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 307, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/module/linear.py", line 552, in backward
    wgrad, grad_bias_, _, rs_out = general_gemm(
                                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/gemm.py", line 141, in general_gemm
    out, bias_grad, gelu_input, extra_output = tex.generic_gemm(*args, **kwargs)
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: /workspace/transformerengine/transformer_engine/common/gemm/cublaslt_gemm.cu:395 in function cublas_gemm: Assertion failed: status != CUBLAS_STATUS_NOT_SUPPORTED. Unable to find suitable cuBLAS GEMM algorithm

Steps/Code to reproduce bug

Run the following example code in the PyTorch container

import torch
import transformer_engine.pytorch as te
from transformer_engine.common import recipe

# Set dimensions.
in_features = 768
out_features = 3072
hidden_size = 2048

# Initialize model and inputs.
model = te.Linear(in_features, out_features, bias=True)
inp = torch.randn(hidden_size, in_features, device="cuda")

# Create an FP8 recipe. Note: All input args are optional.
fp8_recipe = recipe.DelayedScaling(margin=0, fp8_format=recipe.Format.E4M3)

# Enable autocasting for the forward pass
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    out = model(inp)

loss = out.sum()
loss.backward()

Note that the error happens both with the PyTorch container and on bare metal with a pip install.

Environment overview (please complete the following information)

Host machine:

OS Ubuntu 24.04.2 LTS x86_64
Driver 570.124.06
CUDA 12.8
Docker version 26.1.3, build 26.1.3-0ubuntu1~24.04.1
nvidia-container-toolkit/unknown,now 1.17.5-1 amd64

Docker command:

sudo docker run -it --gpus all -v $(pwd):/workspace nvcr.io/nvidia/pytorch:25.03-py3 python3 fp8.py

where fp8.py is the example above.

Device details

2x RTX5090

Apr 08 '25 13:04 TidalPaladin

I have the same problem.

Apr 22 '25 22:04 crinard

Hey guys, I think @sudhakarsingh27 merged this fix here #1659. But trying to build from this PR is miserably failing. When can this be available in a stable release ?

Apr 27 '25 11:04 rajputs37

Hi! I'm also unable to use the fp8 version of a transformer for img2vid. I get this Cudnn error.

Error:

Specs:

I hope this helps! Thank you for all the hard work.

Jun 14 '25 17:06 austinyearlykim

Hi ,I encounter the same issue on rtx6000 Pro GPU on nvcr.io/nvidia/pytorch:25.04-py3 with the **te1_blackwell_ea ** branch. 0: [rank0]: Traceback (most recent call last): 0: [rank0]: File "/workspace/bert/run_pretraining.py", line 2225, in <module> 0: [rank0]: args, final_loss, train_time_raw = main() 0: [rank0]: ^^^^^^ 0: [rank0]: File "/workspace/bert/run_pretraining.py", line 1582, in main 0: [rank0]: model = fwd_loss_bwd_trainer.capture_bert_model_segment_graph(model, use_cuda_graph, graph_capture_large_batch) 0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: [rank0]: File "/workspace/bert/fwd_loss_bwd_trainer.py", line 142, in capture_bert_model_segment_graph 0: [rank0]: bert_model_segment.bert.encoder = make_graphed_callables( 0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ 0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 612, in make_graphed_callables 0: [rank0]: graphed_callables = _make_graphed_callables( 0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ 0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 209, in _make_graphed_callables 0: [rank0]: grad_inputs = torch.autograd.grad( 0: [rank0]: ^^^^^^^^^^^^^^^^^^^^ 0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/__init__.py", line 502, in grad 0: [rank0]: result = _engine_run_backward( 0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^ 0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 824, in _engine_run_backward 0: [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 307, in apply 0: [rank0]: return user_fn(self, *args) 0: [rank0]: ^^^^^^^^^^^^^^^^^^^^ 0: [rank0]: File "/workspace/bert/te_layers.py", line 815, in backward 0: [rank0]: fc2_dgrad, _ = ext.fp8_gemm( 0: [rank0]: ^^^^^^^^^^^^^ 0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/gemm.py", line 289, in fp8_gemm 0: [rank0]: _ = fn(*args) 0: [rank0]: ^^^^^^^^^ 0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__ 0: [rank0]: return self._op(*args, **(kwargs or {})) 0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: [rank0]: RuntimeError: /workspace/transformerengine/transformer_engine/common/gemm/cublaslt_gemm.cu:280 in function cublas_gemm: Assertion failed: status != CUBLAS_STATUS_NOT_SUPPORTED. Unable to find suitable cuBLAS GEMM algorithm

have this issue been fixed ?

Jul 06 '25 13:07 William12github

TransformerEngine TransformerEngine copied to clipboard

[PyTorch] Unable to run FP8 example on 5090

TransformerEngine
TransformerEngine copied to clipboard