TensorRT ❓ [Question] How do you properly deploy a quantized model with tensorrt

❓ Question

I have a PTQ model and a QAT model trained with the official pytorch API following the quantization tutorial, and I wish to deploy them on TensorRT for inference. The model is metaformer-like using convolution layers as token mixer. One part of the quantized model looks like this:

What you have already tried

I have tried different ways to make things work:

the package torch2trt: there's huge problem with dynamic input. The dataset consists of different inputs (B,C,H,W) where H and W are not necessarily the same. There's a torch2trt-dynamic package but I think there are bugs in the plugins. The code basically looks like this: model_trt = torch2trt( model_fp32, [torch.randn(1, 11, 64, 64).to('cuda')], max_batch_size=batch_size, fp16_mode=False, int8_mode=True, calibrator= trainLoader, input_shapes=[(None, 11, None, None)] )
torch.compile() with backends=tensorrt. When I was trying to compile the PTQ model, there's RuntimeError: quantized::conv2d (ONEDNN): data type of input should be QUint8. And when I was trying to use the QAT model, there's W1029 14:21:17.640402 139903289382080 torch/_dynamo/utils.py:1195] [2/0] Unsupported: quantized nyi in meta tensors with fake tensor propagation. Here's the code I used: trt_gm = torch.compile( model, dynamic= True, backend="tensorrt",)
try to convert the torch model to an onnx model, then convert it into the trt engine. There are several problems in this case:

The onnx model is runs weirdly slow with onnx runtime. Furthermore, the loss calculated is extremely high. Here's an example:
I tried to visualize the quantized ONNX model with Netron because converting the quantized ONNX model to TRT engine always raise This is the problematic part of the graph The rightmost DequantizeLinear node is causing problem. I checked the x and found that it's an in32 constant array and the x_scale is a float32 constant array. The output of this node turned out to be the bias passed into the Conv layer. There must be something wrong in the behavior of the conversion. When doing quantization with the pytorch API, only activations and weights were observed by the defined observer, so I was expecting only the leftmost and the middle DequantizeLinear Nodes while bias should be stored in fp32 and directly passed into the Conv layer. Using onnx_simplified is not able to get rid of the node. With the incompatibility between the conversion of quantized torch model to ONNX model, I'm not able to further convert the model into trt engine. I've considered using the onnx API for quantization, but the performance drop thing from unquantized original torch model to ONNX model is quite concerning. The converting code looks like this: torch.onnx.export( quantized_model, dummy_input, args.onnx_export_path, input_names=["input"], output_names=["output"], opset_version=13, export_params= True, keep_initializers_as_inputs=False, dynamic_axes= {'input': {0:'batch_size', 2: "h", 3: "w"}, 'output': {0:'batch_size', 2: "h", 3: "w"} } )

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

PyTorch Version: 2.3.1
CPU Architecture: x86_64
OS: Ubuntu 20.04.4 LTS
How you installed PyTorch (conda, pip, libtorch, source): conda
Are you using local sources or building from archives: No
Python version: 3.9.19
CUDA version: 12.1
GPU models and configuration:
Torch_TensorRT: 2.3.0
torch2trt: 0.5.0
onnx:1.16.1

Additional context

Personally I think the torch.compile() API is the most possible for me to successfully convert the quantized model since there's no performance drop. Does anyone has relevant experience on handling quantized model?

Oct 29 '24 15:10 Urania880519

Did you follow this tutorial? https://pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/vgg16_ptq.html

Oct 29 '24 18:10 narendasan

@narendasan I've followed both the tutorial you provided and this one: https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html#dynamic-shapes However, there's this error after finishing calibration(the calibration seemed successful and the loss was quite low) This is the code I used:

  quant_cfg = mtq.INT8_DEFAULT_CFG
  mtq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
  with torch.no_grad():
      with export_torch_mode():
          input_tensor = torch.randn((1, channels, 35, 35), dtype=torch.float32).to('cuda')
          height_dim = torch.export.Dim("height_dim", min=25, max=64)
          width_dim= torch.export.Dim("width_dim", min=25, max=64)
          dynamic_shapes = ({2: height_dim, 3: width_dim},)
          from torch.export._trace import _export
          exp_program = _export(model, (input_tensor,), dynamic_shapes= dynamic_shapes)
          trt_Qmodel = torchtrt.dynamo.compile(
                  exp_program,
                  inputs=[input_tensor],
                  enabled_precisions={torch.int8},
                  min_block_size=1,
                  debug=False,
                  assume_dynamic_shape_support= True
           )

Oct 30 '24 14:10 Urania880519

@lanluo-nvidia or @peri044 can you provide additional guidance here?

Nov 01 '24 19:11 narendasan

@Urania880519
if you could paste the full code, I can try to reproduce on my side to know what is the exact issue you are facing. Also the in8 quantization support was introduced in 2.5.0 version, if you could try with the 2.5.0 pytorch and torch_tensorrt.

in terms of dynamic shape support in torch_tensorrt, if you have Custom Dynamic Shape Constraints, please refer this tutorial: https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html via torch.export.export()

Nov 03 '24 20:11 lanluo-nvidia

@lanluo-nvidia Thanks for helping!! The thing finally worked with the code below:

quant_cfg = mtq.INT8_DEFAULT_CFG
mtq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
with torch.no_grad():
    with export_torch_mode():
            inputs = [torchtrt.Input(min_shape=(1, channels, 25, 25),  #The dynamic shape settings in my case
                            opt_shape=(16, channels, 35, 35),
                            max_shape=(32, channels, 64, 64),
                            dtype=torch.float32)]
            trt_Qmodel = torchtrt.compile(model, ir="dynamo", inputs= inputs)

However, to deploy the model with torch_tensorrt, I have to run mtq.quantize() every time. I'm not sure about why, but there are still unsolved issues if I want to deploy quantized torch models on tensorrt.

Mar 03 '25 06:03 Urania880519

You should be able to save the quantized model at whatever point is most convenient to you and resume the process from there. So you could save after quantization or you could save after compilation.

Mar 03 '25 22:03 narendasan

TensorRT TensorRT copied to clipboard

❓ [Question] How do you properly deploy a quantized model with tensorrt

❓ Question

What you have already tried

Environment

Additional context

TensorRT
TensorRT copied to clipboard