INT8 Quantization Performance Issue with BERT-like Model
I am currently working on INT8 quantization for a BERT-like embedding model. In the last issue I raised, you mentioned that TensorRT does not currently support INT8 calibration for BERT-like models and suggested that I should use the model_opt tool.
My previous issue (include the specific model structure): https://github.com/NVIDIA/TensorRT/issues/4058
After optimizing the ONNX model with your model_opt tool, I tested inference using trtexec and found that its context size is significantly larger than the FP16 model, and the QPS is slightly lower than the FP16 model. I believe that the INT8 model's context size and QPS should perform better than the FP16 model to meet expectations.
Is there something wrong here?
environment
GPU type: A100 Nvidia driver version: 525.105.17 CUDA version: 12.5 Python version: 3.10.2 TensorRT version: 10.1.0.27 docker image: nvcr.io/nvidia/tensorrt:24.06-py3
Workflow
- First, I used the model_opt tool to convert the model imported from Hugging Face AutoModel into a quantized model, and then exported it to an ONNX file. The key code is as follows:
self.model = AutoModel.from_pretrained(self.model_path)
...
config = mtq.INT8_SMOOTHQUANT_CFG
ptq_model = mtq.quantize(self.model, config, self.forward_loop)
torch.onnx.export(ptq_model,
(self.calibration_data[0]['input_ids'], self.calibration_data[0]['attention_mask']),
onnx_path,
opset_version=17)
- Then, I used trtexec to generate the FP16 model and the INT8 model's engine, respectively:
# for int8 model
trtexec --onnx=${model_path} --saveEngine=${engine_path} \
--builderOptimizationLevel=4 \
--minShapes=input_ids:1x1,attention_mask:1x1 \
--optShapes=input_ids:16x128,attention_mask:16x128 \
--maxShapes=input_ids:128x512,attention_mask:128x512 \
--stronglyTyped \
--verbose
# for fp16 model
trtexec --onnx=${model_path} --saveEngine=${engine_path} \
--minShapes=input_ids:1x1,attention_mask:1x1 \
--optShapes=input_ids:16x128,attention_mask:16x128 \
--maxShapes=input_ids:128x512,attention_mask:128x512 \
--fp16 \
--verbose
- I noticed in the logs that the performance of the INT8 model is significantly worse than the FP16 model.
- Necessary logs for INT8 process
[08/12/2024-09:31:26] [I] [TRT] Loaded engine size: 819 MiB
[08/12/2024-09:31:26] [V] [TRT] Deserialization required 133245 microseconds.
[08/12/2024-09:31:26] [I] Engine deserialized in 1.17933 sec.
[08/12/2024-09:31:26] [V] [TRT] Total per-runner device persistent memory is 0
[08/12/2024-09:31:26] [V] [TRT] Total per-runner host persistent memory is 32
[08/12/2024-09:31:26] [V] [TRT] Allocated device scratch memory of size 5889851904
[08/12/2024-09:31:26] [V] [TRT] - Runner scratch: 5889851904 bytes
[08/12/2024-09:31:26] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +5617, now: CPU 0, GPU 6433 (MiB)
[08/12/2024-09:31:26] [I] Setting persistentCacheLimit to 0 bytes.
[08/12/2024-09:31:26] [I] Set shape of input tensor input_ids to: 16x128
[08/12/2024-09:31:26] [I] Set shape of input tensor attention_mask to: 16x128
[08/12/2024-09:31:26] [I] Created execution context with device memory size: 5617 MiB
[08/12/2024-09:31:29] [I] === Performance summary ===
[08/12/2024-09:31:29] [I] Throughput: 255.879 qps
[08/12/2024-09:31:29] [I] Latency: min = 3.67041 ms, max = 8.69336 ms, mean = 4.1671 ms, median = 3.85425 ms, percentile(90%) = 5.19507 ms, percentile(95%) = 5.67383 ms, percentile(99%) = 6.3772 ms
[08/12/2024-09:31:29] [I] Enqueue Time: min = 2.13928 ms, max = 8.49683 ms, mean = 3.8633 ms, median = 3.54565 ms, percentile(90%) = 4.89307 ms, percentile(95%) = 5.32812 ms, percentile(99%) = 6.08032 ms
[08/12/2024-09:31:29] [I] H2D Latency: min = 0.0135498 ms, max = 0.0247803 ms, mean = 0.0151016 ms, median = 0.0150146 ms, percentile(90%) = 0.0161133 ms, percentile(95%) = 0.0164795 ms, percentile(99%) = 0.017334 ms
[08/12/2024-09:31:29] [I] GPU Compute Time: min = 3.40479 ms, max = 8.42749 ms, mean = 3.89944 ms, median = 3.58398 ms, percentile(90%) = 4.93066 ms, percentile(95%) = 5.40771 ms, percentile(99%) = 6.104 ms
[08/12/2024-09:31:29] [I] D2H Latency: min = 0.25 ms, max = 0.322266 ms, mean = 0.252564 ms, median = 0.251923 ms, percentile(90%) = 0.2547 ms, percentile(95%) = 0.256042 ms, percentile(99%) = 0.269775 ms
[08/12/2024-09:31:29] [I] Total Host Walltime: 3.01314 s
[08/12/2024-09:31:29] [I] Total GPU Compute Time: 3.00647 s
- Necessary logs for FP16 process
[08/12/2024-09:32:37] [I] [TRT] Loaded engine size: 532 MiB
[08/12/2024-09:32:37] [V] [TRT] Deserialization required 85787 microseconds.
[08/12/2024-09:32:37] [I] Engine deserialized in 0.99653 sec.
[08/12/2024-09:32:37] [V] [TRT] Total per-runner device persistent memory is 0
[08/12/2024-09:32:37] [V] [TRT] Total per-runner host persistent memory is 32
[08/12/2024-09:32:37] [V] [TRT] Allocated device scratch memory of size 806421504
[08/12/2024-09:32:37] [V] [TRT] - Runner scratch: 806421504 bytes
[08/12/2024-09:32:37] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +769, now: CPU 0, GPU 1299 (MiB)
[08/12/2024-09:32:37] [I] Setting persistentCacheLimit to 0 bytes.
[08/12/2024-09:32:37] [I] Set shape of input tensor input_ids to: 16x128
[08/12/2024-09:32:37] [I] Set shape of input tensor attention_mask to: 16x128
[08/12/2024-09:32:37] [I] Created execution context with device memory size: 769.063 MiB
[08/12/2024-09:32:41] [I] === Performance summary ===
[08/12/2024-09:32:41] [I] Throughput: 295.562 qps
[08/12/2024-09:32:41] [I] Latency: min = 3.19414 ms, max = 7.85419 ms, mean = 3.64302 ms, median = 3.31763 ms, percentile(90%) = 4.37964 ms, percentile(95%) = 4.95868 ms, percentile(99%) = 6.14673 ms
[08/12/2024-09:32:41] [I] Enqueue Time: min = 1.55664 ms, max = 7.58481 ms, mean = 3.34116 ms, median = 3.0379 ms, percentile(90%) = 4.06354 ms, percentile(95%) = 4.68469 ms, percentile(99%) = 5.7832 ms
[08/12/2024-09:32:41] [I] H2D Latency: min = 0.0131836 ms, max = 0.0334473 ms, mean = 0.0149584 ms, median = 0.0148315 ms, percentile(90%) = 0.0158691 ms, percentile(95%) = 0.0163574 ms, percentile(99%) = 0.0179443 ms
[08/12/2024-09:32:41] [I] GPU Compute Time: min = 2.92659 ms, max = 7.58682 ms, mean = 3.37506 ms, median = 3.05151 ms, percentile(90%) = 4.11133 ms, percentile(95%) = 4.68988 ms, percentile(99%) = 5.87964 ms
[08/12/2024-09:32:41] [I] D2H Latency: min = 0.249268 ms, max = 0.317627 ms, mean = 0.253002 ms, median = 0.25238 ms, percentile(90%) = 0.254822 ms, percentile(95%) = 0.255615 ms, percentile(99%) = 0.270508 ms
[08/12/2024-09:32:41] [I] Total Host Walltime: 3.00783 s
[08/12/2024-09:32:41] [I] Total GPU Compute Time: 3.00043 s
[08/12/2024-09:32:41] [I] Explanations of the performance metrics are printed in the verbose logs.
Do you have any idea?
XLMRobertaModel(
(embeddings): XLMRobertaEmbeddings(
(word_embeddings): Embedding(250002, 768, padding_idx=1)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): XLMRobertaEncoder(
(layer): ModuleList(
(0): XLMRobertaLayer(
(attention): XLMRobertaAttention(
(self): XLMRobertaSelfAttention(
(query): QuantLinear(
in_features=768, out_features=768, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.1968, 1.3766](768) calibrator=MaxCalibrator quant)
)
(key): QuantLinear(
in_features=768, out_features=768, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.2104, 2.3946](768) calibrator=MaxCalibrator quant)
)
(value): QuantLinear(
in_features=768, out_features=768, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.0462, 0.7225](768) calibrator=MaxCalibrator quant)
)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): XLMRobertaSelfOutput(
(dense): QuantLinear(
in_features=768, out_features=768, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.0195, 1.9462](768) calibrator=MaxCalibrator quant)
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): XLMRobertaIntermediate(
(dense): QuantLinear(
in_features=768, out_features=3072, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.1366, 11.6769](3072) calibrator=MaxCalibrator quant)
)
(intermediate_act_fn): GELUActivation()
)
(output): XLMRobertaOutput(
(dense): QuantLinear(
in_features=3072, out_features=768, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.2325, 24.0686](768) calibrator=MaxCalibrator quant)
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(1): XLMRobertaLayer(
(attention): XLMRobertaAttention(
(self): XLMRobertaSelfAttention(
(query): QuantLinear(
in_features=768, out_features=768, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.3190, 5.9233](768) calibrator=MaxCalibrator quant)
)
(key): QuantLinear(
in_features=768, out_features=768, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.2640, 5.9660](768) calibrator=MaxCalibrator quant)
)
(value): QuantLinear(
in_features=768, out_features=768, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.0770, 2.7291](768) calibrator=MaxCalibrator quant)
)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): XLMRobertaSelfOutput(
(dense): QuantLinear(
in_features=768, out_features=768, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.0634, 0.7967](768) calibrator=MaxCalibrator quant)
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): XLMRobertaIntermediate(
(dense): QuantLinear(
in_features=768, out_features=3072, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.1784, 17.2099](3072) calibrator=MaxCalibrator quant)
)
(intermediate_act_fn): GELUActivation()
)
(output): XLMRobertaOutput(
(dense): QuantLinear(
in_features=3072, out_features=768, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.2270, 45.4570](768) calibrator=MaxCalibrator quant)
)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
...
)
(pooler): XLMRobertaPooler(
(dense): QuantLinear(
in_features=768, out_features=768, bias=True
(input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
(output_quantizer): TensorQuantizer(disabled)
(weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.0380, 0.1123](768) calibrator=MaxCalibrator quant)
)
(activation): Tanh()
)
)
The model structure after optimized by modelopt @nvpohanh
Based on prior discussion, it seems you may instead want to try ONNX quantization. Could you try using modelopt.onnx.quantization package instead to see if it can resolve your issue?
@renne444 as per our policy, I am going to close this issue as it's older than 21 days. If you'd like to follow up, please open another issue, thank you.