TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Inserting QDQ has severely impacted the performance of the unquantized Myelin part.

Open zsh4614 opened this issue 10 months ago • 3 comments

Description

I am performing QAT quantization on a complex model. When I insert Q/DQ nodes into the ResNet portion I want to quantize according to the rules, TensorRT can run this part in INT8 after building. How can I ensure that the parts without Q/DQ nodes run with optimal performance in non-INT8 precision (FP16 + FP32)? I noticed that after inserting Q/DQ nodes into a part of the complex network, the performance of the unquantized parts decreases compared to FP16.

I conducted an experiment where I inserted QDQ only before a single convolution layer and obtained the build result. Image

The result of building the same network in FP16 mode. Image

Why does the part within the green box perform differently?

Another question: Even if the input and output of Myelin are exactly the same in the two exported engines, the execution time differs significantly. Image

fp16 mode: Image

I'm confused about how I can ensure that the unquantized parts of my model run optimally in FP16 or FP32.

Environment

TensorRT Version: 8.5.2

NVIDIA GPU: orin / 3090

NVIDIA Driver Version:

CUDA Version: 11.4

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

zsh4614 avatar Dec 23 '24 07:12 zsh4614