TensorRT
TensorRT copied to clipboard
Explicit quantization is slower than implicit quantization and produces invalid results
Description
Since implicit quantization is deprecated, I started migrating my model pipeline to explicit quantization. However, I encountered some issues:
- Different behaviour with concat:
With implicit quantization the graph looks like this:
A(fp16:linear) -> Concat
B(fp16:linear) -> Concat
C(fp16:linear) -> Concat -> Quantize+Reformat -> Conv
Basically, concat is replaced with a basic copy, since all inputs are aligned.
However, when I use explicit quantization the graph becomes like this:
A(fp16:linear) -> Quantize --> Concat
B(fp16:linear) -> Quantize --> Concat
C(fp16:linear) -> Quantize --> Concat --> Reformat -> Conv
TRT switched up Quantize and Concat, but this resulted in a suboptimal graph which is ~30% slower. No matter what I tried, I was not able to reproduce the plan from the implicit quantization using the explicitly quantized model.
- Q/DQ placement with ConvTranspose.
With implicit quantization TRT is able to fuse ConvTranspose and activation, and according to all recommendations, Q/DQ nodes should be placed like this:
input -> Q -> DQ -> ConvTranspose -> Activation -> Q -> DQ -> output
However, when I try this method, TRT fails to merge ConvTranspose and activation and this results in an invalid output. I am forced to do it like this:
input -> Q -> DQ -> ConvTranspose -> Q -> DQ -> Activation -> Q -> DQ -> output
- Explicitly quantized convolutions are slower than implicitly quantized ones
I get consistently slower profiling results with explicitly quantized model (~5%), and it seems like it mostly comes down to tactic selection. Algorithm selectors are deprecated and I cannot understand how to use editable timing cache for CaskConvolution nodes because there are absolutely no cache keys in verbose logs.
Additional issue: since my network uses FP16 inputs I expect TRT to be able to use it directly without any reformats. However, without DIRECT_IO flag TRT always first converts FP16 to FP32 and then back to FP16. DIRECT_IO is deprecated, what should I use as an alternative?
Environment
TensorRT Version: 10.8.0.43
NVIDIA GPU: RTX 3060 LT
NVIDIA Driver Version: 572.47
CUDA Version: 12.8.0
CUDNN Version: 9.7.1.26
Operating System: Windows 11