TensorRT Explicit quantization is slower than implicit quantization and produces invalid results

Explicit quantization is slower than implicit quantization and produces invalid results

Open itmo153277 opened this issue 8 months ago • 2 comments

Description

Since implicit quantization is deprecated, I started migrating my model pipeline to explicit quantization. However, I encountered some issues:

Different behaviour with concat:

With implicit quantization the graph looks like this:

A(fp16:linear) -> Concat
B(fp16:linear) -> Concat
C(fp16:linear) -> Concat -> Quantize+Reformat -> Conv

Basically, concat is replaced with a basic copy, since all inputs are aligned.

However, when I use explicit quantization the graph becomes like this:

A(fp16:linear) -> Quantize --> Concat
B(fp16:linear) -> Quantize --> Concat
C(fp16:linear) -> Quantize --> Concat --> Reformat -> Conv

TRT switched up Quantize and Concat, but this resulted in a suboptimal graph which is ~30% slower. No matter what I tried, I was not able to reproduce the plan from the implicit quantization using the explicitly quantized model.

Q/DQ placement with ConvTranspose.

With implicit quantization TRT is able to fuse ConvTranspose and activation, and according to all recommendations, Q/DQ nodes should be placed like this:

input -> Q -> DQ -> ConvTranspose -> Activation -> Q -> DQ -> output

However, when I try this method, TRT fails to merge ConvTranspose and activation and this results in an invalid output. I am forced to do it like this:

input -> Q -> DQ -> ConvTranspose -> Q -> DQ -> Activation -> Q -> DQ -> output

Explicitly quantized convolutions are slower than implicitly quantized ones

I get consistently slower profiling results with explicitly quantized model (~5%), and it seems like it mostly comes down to tactic selection. Algorithm selectors are deprecated and I cannot understand how to use editable timing cache for CaskConvolution nodes because there are absolutely no cache keys in verbose logs.

Additional issue: since my network uses FP16 inputs I expect TRT to be able to use it directly without any reformats. However, without DIRECT_IO flag TRT always first converts FP16 to FP32 and then back to FP16. DIRECT_IO is deprecated, what should I use as an alternative?

Environment

TensorRT Version: 10.8.0.43

NVIDIA GPU: RTX 3060 LT

NVIDIA Driver Version: 572.47

CUDA Version: 12.8.0

CUDNN Version: 9.7.1.26

Operating System: Windows 11

Relevant Files

Data

Scripts

Feb 24 '25 18:02 itmo153277

TensorRT TensorRT copied to clipboard

Explicit quantization is slower than implicit quantization and produces invalid results

Description

Environment

Relevant Files

TensorRT
TensorRT copied to clipboard