TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

LSTM model converted to TensorRT is slower than PyTorch on RTX 4090

Open jds250 opened this issue 4 months ago • 6 comments

System Information

  • OS: Ubuntu 22.04
  • GPU: NVIDIA RTX 4090
  • TensorRT Version: 10.11.0.33
  • PyTorch Version: 2.7.0
  • ONNX Opset: 14

🧠 Problem Summary

I converted a very basic bidirectional LSTM model from PyTorch to ONNX, and then to TensorRT using trtexec. However, inference with the TensorRT engine is slower than PyTorch, which is unexpected.

  • PyTorch: ~0.5ms per forward pass
  • TensorRT: ~1ms per forward pass

📦 Model Description

# Model config
INPUT_SIZE   = 293
HIDDEN_SIZE  = 128
NUM_LAYERS   = 2
BIDIRECTION  = True
BATCH_FIRST  = True
DROPOUT      = 0.0

# Model instantiation
lstm = nn.LSTM(INPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS,
               bidirectional=BIDIRECTION,
               batch_first=BATCH_FIRST,
               dropout=DROPOUT)

🔄 Conversion Steps

  1. Export to ONNX:
dummy = torch.randn(32, 60, INPUT_SIZE)

torch.onnx.export(
    lstm, dummy, "pyannet_lstm.onnx",
    opset_version=14,
    input_names=["input"],
    output_names=["output", "h_out", "c_out"],
    dynamic_axes={
        "input":  {0: "batch", 1: "time"},
        "output": {0: "batch", 1: "time"},
        "h_out":  {1: "batch"},
        "c_out":  {1: "batch"}
    }
)
  1. Build TensorRT engine:
trtexec \
  --onnx=pyannet_lstm.onnx \
  --minShapes=input:1x60x293 \
  --optShapes=input:32x60x293 \
  --maxShapes=input:32x60x293 \
  --saveEngine=lstm.engine

🧪 Performance Benchmark

PyTorch benchmark code:

x = torch.randn(32, 60, 293).cuda()
lstm.to("cuda").eval()

with torch.no_grad():
    for _ in range(1000):
        output, (h, c) = lstm(x)
  • Average PyTorch time per batch: ~0.5ms
  • Average TensorRT time per batch: ~1.0ms

🔍 Profiling Observation

Using NVIDIA Nsight Systems, I observed:

  • PyTorch uses a fused kernel: RNN_blockPersist_fp_LSTM
  • TensorRT seems to decompose the model into many small ops instead of using a fused LSTM kernel

❓Questions

  • Is it expected that TensorRT does not fuse the LSTM into a single kernel like RNN_blockPersist_fp_LSTM?
  • Are there flags or version requirements to enable such fusion?
  • Is this a known limitation with ONNX -> TensorRT conversion for LSTM?

jds250 avatar Jun 16 '25 11:06 jds250