TensorRT GPU Latency failure for FP16, INT8, mixed precision (FP16+INT8) models of TensorRT 8.6 when running trtexec on GPU A100

Description

Hello, I'm trying to do a torch -> onnx -> trt model conversion. I am doing operations to convert to fp16, to int8 and to mixed precision (fp16 + int8). However, after the conversion is completed, the latency of the fp16 model turns out to be the smallest. Which means fp16 model is faster than int8 and mixed-precision models. Why is that?

Environment

TensorRT Version: 8.6

NVIDIA GPU: A100

NVIDIA Driver Version: 530.30.02

CUDA Version: 12.1

CUDNN Version:

Operating System: Ubuntu 22.04

Python Version (if applicable): 3.10

PyTorch Version (if applicable): 2.1

Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:23.08-py3

Relevant Files

Model link: "vit_base_patch32_224_clip_laion2b" model from timm.models

Steps To Reproduce

Using the pytorch_quantization library we do:

quant_modules.initialize()

quant_desc = QuantDescriptor(num_bits=16)
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc)
quant_nn.QuantConv2d.set_default_quant_desc_weight(quant_desc)
quant_nn.QuantLinear.set_default_quant_desc_weight(quant_desc)

Create a model object in Python (FakeQuant nodes are added automatically because of quant_modules.initialize() line).

m_name = "vit_base_patch32_224_clip_laion2b"
qat_model = create_model(m_name, num_classes=8, exportable=True)

(optionally) If precision is not fp16, but int8, then specify num_bits=8 in point 1 like that:

quant_desc = QuantDescriptor(num_bits=8)

(optionally) If the situation is with mixed_precision, then initially we create num_bits=16, then selectively for individual layers we change the values of input_quantizer and weight_quantizer to 8-bit like this:

qat_model.patch_embed.proj._input_quantizer = TensorQuantizer(quant_desc=QuantDescriptor(num_bits=8))

We calibrate FakeQuant nodes and do QAT.
Do torch.onnx.export.
Simplify the onnx model through

onnx_model = onnx.load(os.path.join(SAVE_PATH, "<model_name>.onnx"))
model_simp, check = onnx_simplifier.simplify(onnx_model, check_n=0)
onnx.save(model_simp, os.path.join(SAVE_PATH, "<model_name>.onnx"))

Then we convert onnx to trt using the trtexec utility. If it is fp16 or int8 precision, then as follows: fp16:

trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')} \
     --exportTimes={os.path.join(SAVE_PATH, 'timing_results.json')} \
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16

int8:

trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')} \
     --exportTimes={os.path.join(SAVE_PATH, 'timing_results.json')} \
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --int8

If this is mixed-precision, then first we create the str variable "LAYERS_PRECISION" and collect precision for layers in it, iterating over the onnx layers of the model. The result is something like: LAYERS_PRECISION="layer1:int8,layer2:int8,layer3:fp16,...,layerN:fp16," And then we execute the following command

trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --fp16 --int8 \
     --precisionConstraints=obey --layerPrecisions={LAYERS_PRECISION} \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw \
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')}

Having done all of the above, we get trt files, which, when checked both through trtexec and through the model-analyzer utility for trt-server, show that the operating speed of the int8 and mixed-precision models is worse than that of the fp16 model.

Commands or scripts: see above

Have you tried the latest release?: yes

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): N/A

Jan 12 '24 04:01 bcd8697

How is the perf of

trtexec\
     --onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
     --fp16 --int8 \
     --minShapes=input:1x3x224x224 \
     --optShapes=input:10x3x224x224 \
     --maxShapes=input:64x3x224x224 \
     --explicitBatch\
     --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw \
     --saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')}

Jan 15 '24 14:01 zerollzeng

@zerollzeng I tried this but the latency is still higher than just fp16...

Jan 15 '24 18:01 bcd8697

Could you please share the onnx here? If it's a QAT model, --int8 should be required otherwise TRT will throw an error.

Jan 16 '24 01:01 zerollzeng

You may hit a known issue in TRT 8.6 and it's fixed in TRT 9.2. could you please try the latest TRT 9.2? you can download it from below link:

https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.linux.x86_64-gnu.cuda-11.8.tar.gz https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.linux.x86_64-gnu.cuda-12.2.tar.gz https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz

Jan 16 '24 01:01 zerollzeng

@zerollzeng Here is the link to zip-archive with 2 my onnx-models: FP16 and mixed precision (FP16-INT8) generated without --precisionConstraints=obey --layerPrecisions={LAYERS_PRECISION} flags, as you proposed earlier. https://drive.google.com/file/d/1dfIufa2aOnLKg2z1zwxd491730mMZcMt/view?usp=sharing

After converting to trt files, FP16 turns out to be faster in execution speed than the mixed precision model.

BTW, what exactly is an issue in TRT 8.6 which is fixed in TRT 9.2?

Thanks

Jan 18 '24 08:01 bcd8697

You just hit a bug that fix in TRT 9.2 :-)

Jan 19 '24 09:01 zerollzeng

@zerollzeng Thanks I have installed and tried TRT 9.2. It seems that it doesn't help and the latency of FP16 is still smaller than mixed-precision (FP16 + INT8).

Maybe any other suggestions?

Jan 22 '24 19:01 bcd8697

Could you please share the onnx that can reproduce this issue?

Jan 27 '24 08:01 zerollzeng

@zerollzeng yes, sure here you are https://drive.google.com/drive/folders/1DPb0HigtNiI9Z8TCn7z0HTL0PYsPYwJ4?usp=sharing

I’m also interested to know when will TRT v9.2 be released in docker images?

Jan 27 '24 09:01 bcd8697

We didn't release it in the official docker image since it's a limited EA release. but you can build the docker by using https://github.com/NVIDIA/TensorRT/blob/release/9.2/docker/ubuntu-20.04.Dockerfile

Jan 27 '24 11:01 zerollzeng

https://drive.google.com/drive/folders/1DPb0HigtNiI9Z8TCn7z0HTL0PYsPYwJ4?usp=sharing

May I ask why I see 2 onnx models here?

Jan 27 '24 11:01 zerollzeng

@zerollzeng One onnx is for FP16 precision and the second one is for mixed precision (FP16 + INT8)

Jan 27 '24 23:01 bcd8697

That's weird, you should only need 1 onnx. What if you compare the perf using only 1 onnx? just set full fp16 and set mixed precision separately.

Jan 28 '24 09:01 zerollzeng

@zerollzeng When FakeQuant nodes are created using

quant_modules.initialize()
quant_desc = QuantDescriptor(num_bits=16)

for further QAT, then I must specify the num_bits parameter.

Shouldn't I make it so that when all layers have num_bits=16, then I only specify the --fp16 flag? And when I change some FakeQuant nodes in INT8, then with trtexec I specify both: --fp16 and --int8 ?

I mean, how does trtexec know which layers I need to convert to int8 precision if I don't specify it anywhere?

I have two onnx because to convert to FP16 I specify num_bits=16, then I do QAT, then I convert to ONNX, and then to a TRT file. To convert to Mixed precision, I specify num_bits=16 and then manually specify the layers I want and specify num_bits=8 for them. Then also QAT, ONNX and TRT stages. If Mixed precision needs to be done differently, then please tell me.

Jan 28 '24 09:01 bcd8697

@ttyio for above questions.

Feb 01 '24 13:02 zerollzeng

@zerollzeng @ttyio Should I wait for any answer about the issue?

Feb 15 '24 08:02 bcd8697

TensorRT TensorRT copied to clipboard

GPU Latency failure for FP16, INT8, mixed precision (FP16+INT8) models of TensorRT 8.6 when running trtexec on GPU A100

Description

Environment

Relevant Files

Steps To Reproduce

TensorRT
TensorRT copied to clipboard