TensorRT
                                
                                 TensorRT copied to clipboard
                                
                                    TensorRT copied to clipboard
                            
                            
                            
                        LSTM model converted to TensorRT is slower than PyTorch on RTX 4090
System Information
- OS: Ubuntu 22.04
- GPU: NVIDIA RTX 4090
- TensorRT Version: 10.11.0.33
- PyTorch Version: 2.7.0
- ONNX Opset: 14
🧠 Problem Summary
I converted a very basic bidirectional LSTM model from PyTorch to ONNX, and then to TensorRT using trtexec. However, inference with the TensorRT engine is slower than PyTorch, which is unexpected.
- PyTorch: ~0.5ms per forward pass
- TensorRT: ~1ms per forward pass
📦 Model Description
# Model config
INPUT_SIZE   = 293
HIDDEN_SIZE  = 128
NUM_LAYERS   = 2
BIDIRECTION  = True
BATCH_FIRST  = True
DROPOUT      = 0.0
# Model instantiation
lstm = nn.LSTM(INPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS,
               bidirectional=BIDIRECTION,
               batch_first=BATCH_FIRST,
               dropout=DROPOUT)
🔄 Conversion Steps
- Export to ONNX:
dummy = torch.randn(32, 60, INPUT_SIZE)
torch.onnx.export(
    lstm, dummy, "pyannet_lstm.onnx",
    opset_version=14,
    input_names=["input"],
    output_names=["output", "h_out", "c_out"],
    dynamic_axes={
        "input":  {0: "batch", 1: "time"},
        "output": {0: "batch", 1: "time"},
        "h_out":  {1: "batch"},
        "c_out":  {1: "batch"}
    }
)
- Build TensorRT engine:
trtexec \
  --onnx=pyannet_lstm.onnx \
  --minShapes=input:1x60x293 \
  --optShapes=input:32x60x293 \
  --maxShapes=input:32x60x293 \
  --saveEngine=lstm.engine
🧪 Performance Benchmark
PyTorch benchmark code:
x = torch.randn(32, 60, 293).cuda()
lstm.to("cuda").eval()
with torch.no_grad():
    for _ in range(1000):
        output, (h, c) = lstm(x)
- Average PyTorch time per batch: ~0.5ms
- Average TensorRT time per batch: ~1.0ms
🔍 Profiling Observation
Using NVIDIA Nsight Systems, I observed:
- PyTorch uses a fused kernel: RNN_blockPersist_fp_LSTM
- TensorRT seems to decompose the model into many small ops instead of using a fused LSTM kernel
❓Questions
- Is it expected that TensorRT does not fuse the LSTM into a single kernel like RNN_blockPersist_fp_LSTM?
- Are there flags or version requirements to enable such fusion?
- Is this a known limitation with ONNX -> TensorRT conversion for LSTM?