TensorRT
TensorRT copied to clipboard
GPU Latency failure for FP16, INT8, mixed precision (FP16+INT8) models of TensorRT 8.6 when running trtexec on GPU A100
Description
Hello, I'm trying to do a torch -> onnx -> trt model conversion. I am doing operations to convert to fp16, to int8 and to mixed precision (fp16 + int8). However, after the conversion is completed, the latency of the fp16 model turns out to be the smallest. Which means fp16 model is faster than int8 and mixed-precision models. Why is that?
Environment
TensorRT Version: 8.6
NVIDIA GPU: A100
NVIDIA Driver Version: 530.30.02
CUDA Version: 12.1
CUDNN Version:
Operating System: Ubuntu 22.04
Python Version (if applicable): 3.10
PyTorch Version (if applicable): 2.1
Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:23.08-py3
Relevant Files
Model link: "vit_base_patch32_224_clip_laion2b" model from timm.models
Steps To Reproduce
- Using the pytorch_quantization library we do:
quant_modules.initialize()
quant_desc = QuantDescriptor(num_bits=16)
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc)
quant_nn.QuantConv2d.set_default_quant_desc_weight(quant_desc)
quant_nn.QuantLinear.set_default_quant_desc_weight(quant_desc)
- Create a model object in Python (FakeQuant nodes are added automatically because of quant_modules.initialize() line).
m_name = "vit_base_patch32_224_clip_laion2b"
qat_model = create_model(m_name, num_classes=8, exportable=True)
(optionally) If precision is not fp16, but int8, then specify num_bits=8 in point 1 like that:
quant_desc = QuantDescriptor(num_bits=8)
(optionally) If the situation is with mixed_precision, then initially we create num_bits=16, then selectively for individual layers we change the values of input_quantizer and weight_quantizer to 8-bit like this:
qat_model.patch_embed.proj._input_quantizer = TensorQuantizer(quant_desc=QuantDescriptor(num_bits=8))
-
We calibrate FakeQuant nodes and do QAT.
-
Do torch.onnx.export.
-
Simplify the onnx model through
onnx_model = onnx.load(os.path.join(SAVE_PATH, "<model_name>.onnx"))
model_simp, check = onnx_simplifier.simplify(onnx_model, check_n=0)
onnx.save(model_simp, os.path.join(SAVE_PATH, "<model_name>.onnx"))
- Then we convert onnx to trt using the trtexec utility. If it is fp16 or int8 precision, then as follows: fp16:
trtexec\
--onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
--minShapes=input:1x3x224x224 \
--optShapes=input:10x3x224x224 \
--maxShapes=input:64x3x224x224 \
--explicitBatch\
--saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')} \
--exportTimes={os.path.join(SAVE_PATH, 'timing_results.json')} \
--inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16
int8:
trtexec\
--onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
--minShapes=input:1x3x224x224 \
--optShapes=input:10x3x224x224 \
--maxShapes=input:64x3x224x224 \
--explicitBatch\
--saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')} \
--exportTimes={os.path.join(SAVE_PATH, 'timing_results.json')} \
--inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --int8
If this is mixed-precision, then first we create the str variable "LAYERS_PRECISION" and collect precision for layers in it, iterating over the onnx layers of the model. The result is something like: LAYERS_PRECISION="layer1:int8,layer2:int8,layer3:fp16,...,layerN:fp16," And then we execute the following command
trtexec\
--onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
--fp16 --int8 \
--precisionConstraints=obey --layerPrecisions={LAYERS_PRECISION} \
--minShapes=input:1x3x224x224 \
--optShapes=input:10x3x224x224 \
--maxShapes=input:64x3x224x224 \
--explicitBatch\
--inputIOFormats=fp16:chw --outputIOFormats=fp16:chw \
--saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')}
Having done all of the above, we get trt files, which, when checked both through trtexec and through the model-analyzer utility for trt-server, show that the operating speed of the int8 and mixed-precision models is worse than that of the fp16 model.
Commands or scripts: see above
Have you tried the latest release?: yes
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): N/A
How is the perf of
trtexec\
--onnx={os.path.join(SAVE_PATH, '<model_name>.onnx')} \
--fp16 --int8 \
--minShapes=input:1x3x224x224 \
--optShapes=input:10x3x224x224 \
--maxShapes=input:64x3x224x224 \
--explicitBatch\
--inputIOFormats=fp16:chw --outputIOFormats=fp16:chw \
--saveEngine={os.path.join(SAVE_PATH, '<model_name>.trt')}
@zerollzeng I tried this but the latency is still higher than just fp16...
Could you please share the onnx here? If it's a QAT model, --int8 should be required otherwise TRT will throw an error.
You may hit a known issue in TRT 8.6 and it's fixed in TRT 9.2. could you please try the latest TRT 9.2? you can download it from below link:
https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.linux.x86_64-gnu.cuda-11.8.tar.gz https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.linux.x86_64-gnu.cuda-12.2.tar.gz https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz
@zerollzeng
Here is the link to zip-archive with 2 my onnx-models: FP16 and mixed precision (FP16-INT8) generated without
--precisionConstraints=obey --layerPrecisions={LAYERS_PRECISION} flags, as you proposed earlier.
https://drive.google.com/file/d/1dfIufa2aOnLKg2z1zwxd491730mMZcMt/view?usp=sharing
After converting to trt files, FP16 turns out to be faster in execution speed than the mixed precision model.
BTW, what exactly is an issue in TRT 8.6 which is fixed in TRT 9.2?
Thanks
You just hit a bug that fix in TRT 9.2 :-)
@zerollzeng Thanks I have installed and tried TRT 9.2. It seems that it doesn't help and the latency of FP16 is still smaller than mixed-precision (FP16 + INT8).
Maybe any other suggestions?
Could you please share the onnx that can reproduce this issue?
@zerollzeng yes, sure here you are https://drive.google.com/drive/folders/1DPb0HigtNiI9Z8TCn7z0HTL0PYsPYwJ4?usp=sharing
I’m also interested to know when will TRT v9.2 be released in docker images?
We didn't release it in the official docker image since it's a limited EA release. but you can build the docker by using https://github.com/NVIDIA/TensorRT/blob/release/9.2/docker/ubuntu-20.04.Dockerfile
https://drive.google.com/drive/folders/1DPb0HigtNiI9Z8TCn7z0HTL0PYsPYwJ4?usp=sharing
May I ask why I see 2 onnx models here?
@zerollzeng One onnx is for FP16 precision and the second one is for mixed precision (FP16 + INT8)
That's weird, you should only need 1 onnx. What if you compare the perf using only 1 onnx? just set full fp16 and set mixed precision separately.
@zerollzeng When FakeQuant nodes are created using
quant_modules.initialize()
quant_desc = QuantDescriptor(num_bits=16)
for further QAT, then I must specify the num_bits parameter.
Shouldn't I make it so that when all layers have num_bits=16, then I only specify the --fp16 flag?
And when I change some FakeQuant nodes in INT8, then with trtexec I specify both: --fp16 and --int8 ?
I mean, how does trtexec know which layers I need to convert to int8 precision if I don't specify it anywhere?
I have two onnx because to convert to FP16 I specify num_bits=16, then I do QAT, then I convert to ONNX, and then to a TRT file.
To convert to Mixed precision, I specify num_bits=16 and then manually specify the layers I want and specify num_bits=8 for them. Then also QAT, ONNX and TRT stages.
If Mixed precision needs to be done differently, then please tell me.
@ttyio for above questions.
@zerollzeng @ttyio Should I wait for any answer about the issue?