TensorRT LSTM model converted to TensorRT is slower than PyTorch on RTX 4090

LSTM model converted to TensorRT is slower than PyTorch on RTX 4090

Open jds250 opened this issue 4 months ago • 6 comments

System Information

OS: Ubuntu 22.04
GPU: NVIDIA RTX 4090
TensorRT Version: 10.11.0.33
PyTorch Version: 2.7.0
ONNX Opset: 14

🧠 Problem Summary

I converted a very basic bidirectional LSTM model from PyTorch to ONNX, and then to TensorRT using trtexec. However, inference with the TensorRT engine is slower than PyTorch, which is unexpected.

PyTorch: ~0.5ms per forward pass
TensorRT: ~1ms per forward pass

📦 Model Description

# Model config
INPUT_SIZE   = 293
HIDDEN_SIZE  = 128
NUM_LAYERS   = 2
BIDIRECTION  = True
BATCH_FIRST  = True
DROPOUT      = 0.0

# Model instantiation
lstm = nn.LSTM(INPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS,
               bidirectional=BIDIRECTION,
               batch_first=BATCH_FIRST,
               dropout=DROPOUT)

🔄 Conversion Steps

Export to ONNX:

dummy = torch.randn(32, 60, INPUT_SIZE)

torch.onnx.export(
    lstm, dummy, "pyannet_lstm.onnx",
    opset_version=14,
    input_names=["input"],
    output_names=["output", "h_out", "c_out"],
    dynamic_axes={
        "input":  {0: "batch", 1: "time"},
        "output": {0: "batch", 1: "time"},
        "h_out":  {1: "batch"},
        "c_out":  {1: "batch"}
    }
)

Build TensorRT engine:

trtexec \
  --onnx=pyannet_lstm.onnx \
  --minShapes=input:1x60x293 \
  --optShapes=input:32x60x293 \
  --maxShapes=input:32x60x293 \
  --saveEngine=lstm.engine

🧪 Performance Benchmark

PyTorch benchmark code:

x = torch.randn(32, 60, 293).cuda()
lstm.to("cuda").eval()

with torch.no_grad():
    for _ in range(1000):
        output, (h, c) = lstm(x)

Average PyTorch time per batch: ~0.5ms
Average TensorRT time per batch: ~1.0ms

🔍 Profiling Observation

Using NVIDIA Nsight Systems, I observed:

PyTorch uses a fused kernel: RNN_blockPersist_fp_LSTM
TensorRT seems to decompose the model into many small ops instead of using a fused LSTM kernel

❓Questions

Is it expected that TensorRT does not fuse the LSTM into a single kernel like RNN_blockPersist_fp_LSTM?
Are there flags or version requirements to enable such fusion?
Is this a known limitation with ONNX -> TensorRT conversion for LSTM?

Jun 16 '25 11:06 jds250

TensorRT TensorRT copied to clipboard

LSTM model converted to TensorRT is slower than PyTorch on RTX 4090

🧠 Problem Summary

📦 Model Description

🔄 Conversion Steps

🧪 Performance Benchmark

🔍 Profiling Observation

❓Questions

TensorRT
TensorRT copied to clipboard