TensorRT
TensorRT copied to clipboard
Inserting QDQ has severely impacted the performance of the unquantized Myelin part.
Description
I am performing QAT quantization on a complex model. When I insert Q/DQ nodes into the ResNet portion I want to quantize according to the rules, TensorRT can run this part in INT8 after building. How can I ensure that the parts without Q/DQ nodes run with optimal performance in non-INT8 precision (FP16 + FP32)? I noticed that after inserting Q/DQ nodes into a part of the complex network, the performance of the unquantized parts decreases compared to FP16.
I conducted an experiment where I inserted QDQ only before a single convolution layer and obtained the build result.
The result of building the same network in FP16 mode.
Why does the part within the green box perform differently?
Another question: Even if the input and output of Myelin are exactly the same in the two exported engines, the execution time differs significantly.
fp16 mode:
I'm confused about how I can ensure that the unquantized parts of my model run optimally in FP16 or FP32.
Environment
TensorRT Version: 8.5.2
NVIDIA GPU: orin / 3090
NVIDIA Driver Version:
CUDA Version: 11.4
CUDNN Version:
Operating System:
Python Version (if applicable):
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version):
Relevant Files
Model link:
Steps To Reproduce
Commands or scripts:
Have you tried the latest release?:
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):