TensorRT INT8 Quantization Performance Issue with BERT-like Model

I am currently working on INT8 quantization for a BERT-like embedding model. In the last issue I raised, you mentioned that TensorRT does not currently support INT8 calibration for BERT-like models and suggested that I should use the model_opt tool.

My previous issue (include the specific model structure): https://github.com/NVIDIA/TensorRT/issues/4058

After optimizing the ONNX model with your model_opt tool, I tested inference using trtexec and found that its context size is significantly larger than the FP16 model, and the QPS is slightly lower than the FP16 model. I believe that the INT8 model's context size and QPS should perform better than the FP16 model to meet expectations.

Is there something wrong here?

environment

GPU type: A100 Nvidia driver version: 525.105.17 CUDA version: 12.5 Python version: 3.10.2 TensorRT version: 10.1.0.27 docker image: nvcr.io/nvidia/tensorrt:24.06-py3

Workflow

First, I used the model_opt tool to convert the model imported from Hugging Face AutoModel into a quantized model, and then exported it to an ONNX file. The key code is as follows:

self.model = AutoModel.from_pretrained(self.model_path)
...
config = mtq.INT8_SMOOTHQUANT_CFG
ptq_model = mtq.quantize(self.model, config, self.forward_loop)
torch.onnx.export(ptq_model,
            (self.calibration_data[0]['input_ids'], self.calibration_data[0]['attention_mask']),
            onnx_path,
            opset_version=17)

Then, I used trtexec to generate the FP16 model and the INT8 model's engine, respectively:

# for int8 model
trtexec --onnx=${model_path} --saveEngine=${engine_path} \
        --builderOptimizationLevel=4 \
        --minShapes=input_ids:1x1,attention_mask:1x1 \
        --optShapes=input_ids:16x128,attention_mask:16x128 \
        --maxShapes=input_ids:128x512,attention_mask:128x512 \
        --stronglyTyped \
      	--verbose

# for fp16 model
trtexec --onnx=${model_path} --saveEngine=${engine_path} \
        --minShapes=input_ids:1x1,attention_mask:1x1 \
        --optShapes=input_ids:16x128,attention_mask:16x128 \
        --maxShapes=input_ids:128x512,attention_mask:128x512 \
        --fp16 \
      	--verbose

I noticed in the logs that the performance of the INT8 model is significantly worse than the FP16 model.

Necessary logs for INT8 process

[08/12/2024-09:31:26] [I] [TRT] Loaded engine size: 819 MiB
[08/12/2024-09:31:26] [V] [TRT] Deserialization required 133245 microseconds.
[08/12/2024-09:31:26] [I] Engine deserialized in 1.17933 sec.
[08/12/2024-09:31:26] [V] [TRT] Total per-runner device persistent memory is 0
[08/12/2024-09:31:26] [V] [TRT] Total per-runner host persistent memory is 32
[08/12/2024-09:31:26] [V] [TRT] Allocated device scratch memory of size 5889851904
[08/12/2024-09:31:26] [V] [TRT] - Runner scratch: 5889851904 bytes
[08/12/2024-09:31:26] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +5617, now: CPU 0, GPU 6433 (MiB)
[08/12/2024-09:31:26] [I] Setting persistentCacheLimit to 0 bytes.
[08/12/2024-09:31:26] [I] Set shape of input tensor input_ids to: 16x128
[08/12/2024-09:31:26] [I] Set shape of input tensor attention_mask to: 16x128
[08/12/2024-09:31:26] [I] Created execution context with device memory size: 5617 MiB


[08/12/2024-09:31:29] [I] === Performance summary ===
[08/12/2024-09:31:29] [I] Throughput: 255.879 qps
[08/12/2024-09:31:29] [I] Latency: min = 3.67041 ms, max = 8.69336 ms, mean = 4.1671 ms, median = 3.85425 ms, percentile(90%) = 5.19507 ms, percentile(95%) = 5.67383 ms, percentile(99%) = 6.3772 ms
[08/12/2024-09:31:29] [I] Enqueue Time: min = 2.13928 ms, max = 8.49683 ms, mean = 3.8633 ms, median = 3.54565 ms, percentile(90%) = 4.89307 ms, percentile(95%) = 5.32812 ms, percentile(99%) = 6.08032 ms
[08/12/2024-09:31:29] [I] H2D Latency: min = 0.0135498 ms, max = 0.0247803 ms, mean = 0.0151016 ms, median = 0.0150146 ms, percentile(90%) = 0.0161133 ms, percentile(95%) = 0.0164795 ms, percentile(99%) = 0.017334 ms
[08/12/2024-09:31:29] [I] GPU Compute Time: min = 3.40479 ms, max = 8.42749 ms, mean = 3.89944 ms, median = 3.58398 ms, percentile(90%) = 4.93066 ms, percentile(95%) = 5.40771 ms, percentile(99%) = 6.104 ms
[08/12/2024-09:31:29] [I] D2H Latency: min = 0.25 ms, max = 0.322266 ms, mean = 0.252564 ms, median = 0.251923 ms, percentile(90%) = 0.2547 ms, percentile(95%) = 0.256042 ms, percentile(99%) = 0.269775 ms
[08/12/2024-09:31:29] [I] Total Host Walltime: 3.01314 s
[08/12/2024-09:31:29] [I] Total GPU Compute Time: 3.00647 s

Necessary logs for FP16 process

[08/12/2024-09:32:37] [I] [TRT] Loaded engine size: 532 MiB
[08/12/2024-09:32:37] [V] [TRT] Deserialization required 85787 microseconds.
[08/12/2024-09:32:37] [I] Engine deserialized in 0.99653 sec.
[08/12/2024-09:32:37] [V] [TRT] Total per-runner device persistent memory is 0
[08/12/2024-09:32:37] [V] [TRT] Total per-runner host persistent memory is 32
[08/12/2024-09:32:37] [V] [TRT] Allocated device scratch memory of size 806421504
[08/12/2024-09:32:37] [V] [TRT] - Runner scratch: 806421504 bytes
[08/12/2024-09:32:37] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +769, now: CPU 0, GPU 1299 (MiB)
[08/12/2024-09:32:37] [I] Setting persistentCacheLimit to 0 bytes.
[08/12/2024-09:32:37] [I] Set shape of input tensor input_ids to: 16x128
[08/12/2024-09:32:37] [I] Set shape of input tensor attention_mask to: 16x128
[08/12/2024-09:32:37] [I] Created execution context with device memory size: 769.063 MiB


[08/12/2024-09:32:41] [I] === Performance summary ===
[08/12/2024-09:32:41] [I] Throughput: 295.562 qps
[08/12/2024-09:32:41] [I] Latency: min = 3.19414 ms, max = 7.85419 ms, mean = 3.64302 ms, median = 3.31763 ms, percentile(90%) = 4.37964 ms, percentile(95%) = 4.95868 ms, percentile(99%) = 6.14673 ms
[08/12/2024-09:32:41] [I] Enqueue Time: min = 1.55664 ms, max = 7.58481 ms, mean = 3.34116 ms, median = 3.0379 ms, percentile(90%) = 4.06354 ms, percentile(95%) = 4.68469 ms, percentile(99%) = 5.7832 ms
[08/12/2024-09:32:41] [I] H2D Latency: min = 0.0131836 ms, max = 0.0334473 ms, mean = 0.0149584 ms, median = 0.0148315 ms, percentile(90%) = 0.0158691 ms, percentile(95%) = 0.0163574 ms, percentile(99%) = 0.0179443 ms
[08/12/2024-09:32:41] [I] GPU Compute Time: min = 2.92659 ms, max = 7.58682 ms, mean = 3.37506 ms, median = 3.05151 ms, percentile(90%) = 4.11133 ms, percentile(95%) = 4.68988 ms, percentile(99%) = 5.87964 ms
[08/12/2024-09:32:41] [I] D2H Latency: min = 0.249268 ms, max = 0.317627 ms, mean = 0.253002 ms, median = 0.25238 ms, percentile(90%) = 0.254822 ms, percentile(95%) = 0.255615 ms, percentile(99%) = 0.270508 ms
[08/12/2024-09:32:41] [I] Total Host Walltime: 3.00783 s
[08/12/2024-09:32:41] [I] Total GPU Compute Time: 3.00043 s
[08/12/2024-09:32:41] [I] Explanations of the performance metrics are printed in the verbose logs.

Do you have any idea?

Aug 12 '24 10:08 renne444

XLMRobertaModel(
  (embeddings): XLMRobertaEmbeddings(
    (word_embeddings): Embedding(250002, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): XLMRobertaEncoder(
    (layer): ModuleList(
      (0): XLMRobertaLayer(
        (attention): XLMRobertaAttention(
          (self): XLMRobertaSelfAttention(
            (query): QuantLinear(
              in_features=768, out_features=768, bias=True
              (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
              (output_quantizer): TensorQuantizer(disabled)
              (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.1968, 1.3766](768) calibrator=MaxCalibrator quant)
            )
            (key): QuantLinear(
              in_features=768, out_features=768, bias=True
              (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
              (output_quantizer): TensorQuantizer(disabled)
              (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.2104, 2.3946](768) calibrator=MaxCalibrator quant)
            )
            (value): QuantLinear(
              in_features=768, out_features=768, bias=True
              (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
              (output_quantizer): TensorQuantizer(disabled)
              (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.0462, 0.7225](768) calibrator=MaxCalibrator quant)
            )
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): XLMRobertaSelfOutput(
            (dense): QuantLinear(
              in_features=768, out_features=768, bias=True
              (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
              (output_quantizer): TensorQuantizer(disabled)
              (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.0195, 1.9462](768) calibrator=MaxCalibrator quant)
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): XLMRobertaIntermediate(
          (dense): QuantLinear(
            in_features=768, out_features=3072, bias=True
            (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
            (output_quantizer): TensorQuantizer(disabled)
            (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.1366, 11.6769](3072) calibrator=MaxCalibrator quant)
          )
          (intermediate_act_fn): GELUActivation()
        )
        (output): XLMRobertaOutput(
          (dense): QuantLinear(
            in_features=3072, out_features=768, bias=True
            (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
            (output_quantizer): TensorQuantizer(disabled)
            (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.2325, 24.0686](768) calibrator=MaxCalibrator quant)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): XLMRobertaLayer(
        (attention): XLMRobertaAttention(
          (self): XLMRobertaSelfAttention(
            (query): QuantLinear(
              in_features=768, out_features=768, bias=True
              (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
              (output_quantizer): TensorQuantizer(disabled)
              (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.3190, 5.9233](768) calibrator=MaxCalibrator quant)
            )
            (key): QuantLinear(
              in_features=768, out_features=768, bias=True
              (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
              (output_quantizer): TensorQuantizer(disabled)
              (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.2640, 5.9660](768) calibrator=MaxCalibrator quant)
            )
            (value): QuantLinear(
              in_features=768, out_features=768, bias=True
              (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
              (output_quantizer): TensorQuantizer(disabled)
              (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.0770, 2.7291](768) calibrator=MaxCalibrator quant)
            )
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): XLMRobertaSelfOutput(
            (dense): QuantLinear(
              in_features=768, out_features=768, bias=True
              (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
              (output_quantizer): TensorQuantizer(disabled)
              (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.0634, 0.7967](768) calibrator=MaxCalibrator quant)
            )
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): XLMRobertaIntermediate(
          (dense): QuantLinear(
            in_features=768, out_features=3072, bias=True
            (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
            (output_quantizer): TensorQuantizer(disabled)
            (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.1784, 17.2099](3072) calibrator=MaxCalibrator quant)
          )
          (intermediate_act_fn): GELUActivation()
        )
        (output): XLMRobertaOutput(
          (dense): QuantLinear(
            in_features=3072, out_features=768, bias=True
            (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
            (output_quantizer): TensorQuantizer(disabled)
            (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.2270, 45.4570](768) calibrator=MaxCalibrator quant)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      ...
  )
  (pooler): XLMRobertaPooler(
    (dense): QuantLinear(
      in_features=768, out_features=768, bias=True
      (input_quantizer): TensorQuantizer(8 bit fake per-tensor amax=1.0000 pre_quant_scale calibrator=MaxCalibrator quant)
      (output_quantizer): TensorQuantizer(disabled)
      (weight_quantizer): TensorQuantizer(8 bit fake axis=0 amax=[0.0380, 0.1123](768) calibrator=MaxCalibrator quant)
    )
    (activation): Tanh()
  )
)

The model structure after optimized by modelopt @nvpohanh

Aug 13 '24 02:08 renne444

Based on prior discussion, it seems you may instead want to try ONNX quantization. Could you try using modelopt.onnx.quantization package instead to see if it can resolve your issue?

Aug 30 '24 23:08 akhilg-nv

@renne444 as per our policy, I am going to close this issue as it's older than 21 days. If you'd like to follow up, please open another issue, thank you.

Sep 21 '24 01:09 moraxu