TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

❓ [Question] compiled ExportedProgram is slower than uncompiled model

Open Qi-Zha0 opened this issue 1 year ago • 1 comments

❓ Question

I tried compiling a few models with torch_tensorrt.compile(model, inputs, ir='dynamo', ...) and each one of them was slower than the respective uncompiled model. I was wondering if I was using torch_tensorrt incorrectly.

What you have already tried

A minimum example:

import torch
import torch_tensorrt
import time

model = torch.hub.load('pytorch/vision:v0.10.0', 'mobilenet_v2', pretrained=True)
model.eval().cuda()

inputs = [
    torch_tensorrt.Input(
        shape=torch.Size((1, 3, 480, 640)),
        dtype=torch.float,
    )
]
trt_model = torch_tensorrt.compile(model, inputs=inputs, ir='dynamo', truncate_long_and_double=True, enabled_precisions={torch.half}, opt_level='max')

The inference time was measured as below:

x = torch.rand((1, 3, 480, 640)).cuda() - 0.5

# warm up 
for _ in range(10):
  trt_model(x)

total_time = 0
for _ in range(20):
  start = time.time()
  out = trt_model(x)
  total_time += time.time() - start
print(total_time / 20)

On average the uncompiled model inference time is 4ms and compiled model 9ms.

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • PyTorch Version (e.g., 1.0): 2.2.1
  • CPU Architecture: x86_64
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, libtorch, source): pip intall torch torch_tensorrt
  • Build command you used (if compiling from source):
  • Are you using local sources or building from archives:
  • Python version: 3.11
  • CUDA version: 12.3
  • GPU models and configuration: NVIDIA GeForce RTX 4050
  • Any other relevant information:

Additional context

Qi-Zha0 avatar Mar 28 '24 06:03 Qi-Zha0

Update: In comparison, trt_model = torch.compile(model, backend='tensorrt', options={'truncate_long_and_double': True, 'enabled_precisions': {torch.half}}) reduces the inference time to 0.3ms. But JIT compilation would be infeasible for my use case since the time it takes to recompile every time would be too long.

Update 2: Using dynamo with output_format='fx' does accelerate the model. Was I using the ExportedProgram wrong?

Qi-Zha0 avatar Mar 28 '24 20:03 Qi-Zha0