TensorRT fp16 onnx -> fp16 tensorrt mismatched outputs

Description

outputs of fp16 onnx to tensorrt are different from outputs of onnx

Environment

TensorRT Version: 8.4.1.5 NVIDIA GPU: rts3090 NVIDIA Driver Version: 510.60.02 CUDA Version: 11.6 CUDNN Version: Operating System: ubuntu20.04 Python Version (if applicable): 3.8 Tensorflow Version (if applicable): PyTorch Version (if applicable): 1.12.1 Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:22.07-py3

Relevant Files

https://drive.google.com/drive/folders/1EPETChB3jFtWrdLDEZYUe296ZzOMPWxP?usp=sharing

Steps To Reproduce

convert model to onnx

import torch
model = Model()
device = torch.device('cuda:0')
model.load_state_dict(torch.load('model.pt'))
model = model.to(device)
model.eval()

fp = 16

if fp == 16:
    model = model.half()
with torch.no_grad():
    batch_size = 1
    seq_length = 128

    input_ids = torch.ones(batch_size, seq_length).type(torch.int32).to(device)
    attention_mask = torch.zeros(batch_size, seq_length).type(torch.int32).to(device)
    rel_masks = torch.zeros(batch_size, seq_length).type(torch.int32).to(device)

    torch.onnx.export(entity_model, (input_ids, attention_mask, rel_masks), f'model_{fp}.onnx',
                              input_names=['input_ids', 'attention_mask', 'rel_mask'], output_names=['outputs'],
                              export_params=True)

fp32 onnx to fp32 tensorrt

polygraphy run model_32.onnx --trt --onnxrt --save-engine=model_32.plan --pool-limit workspace:1G

[I] trt-runner-N0-09/19/22-05:25:05     | Activating and starting inference
[09/19/2022-05:25:07] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[I]     Configuring with profiles: [Profile().add('input_ids', min=[1, 128], opt=[1, 128], max=[1, 128]).add('attention_mask', min=[1, 128], opt=[1, 128], max=[1, 128]).add('rel_mask', min=[1, 128], opt=[1, 128], max=[1, 128])]
[I] Building engine with configuration:
    Workspace            | 1073741824 bytes (1024.00 MiB)
    Precision            | TF32: False, FP16: False, INT8: False, Obey Precision Constraints: False, Strict Types: False
    Tactic Sources       | ['CUBLAS', 'CUBLAS_LT', 'CUDNN', 'EDGE_MASK_CONVOLUTIONS']
    Safety Restricted    | False
    Refittable           | False
    Profiles             | 1 profile(s)
[09/19/2022-05:25:13] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[09/19/2022-05:25:13] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[I] Finished engine building in 5.797 seconds
[09/19/2022-05:25:13] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[09/19/2022-05:25:13] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[I] Saving engine to model_32.plan
[I] trt-runner-N0-09/19/22-05:25:05
    ---- Inference Input(s) ----
    {input_ids [dtype=int32, shape=(1, 128)],
     attention_mask [dtype=int32, shape=(1, 128)],
     rel_mask [dtype=int32, shape=(1, 128)]}
[I] trt-runner-N0-09/19/22-05:25:05
    ---- Inference Output(s) ----
    {outputs [dtype=float32, shape=(1, 128, 29)]}
[I] trt-runner-N0-09/19/22-05:25:05     | Completed 1 iteration(s) in 2.929 ms | Average inference time: 2.929 ms.
[I] onnxrt-runner-N0-09/19/22-05:25:05  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-09/19/22-05:25:05
    ---- Inference Input(s) ----
    {input_ids [dtype=int32, shape=(1, 128)],
     attention_mask [dtype=int32, shape=(1, 128)],
     rel_mask [dtype=int32, shape=(1, 128)]}
[I] onnxrt-runner-N0-09/19/22-05:25:05
    ---- Inference Output(s) ----
    {outputs [dtype=float32, shape=(1, 128, 29)]}
[I] onnxrt-runner-N0-09/19/22-05:25:05  | Completed 1 iteration(s) in 42.64 ms | Average inference time: 42.64 ms.
[I] Accuracy Comparison | trt-runner-N0-09/19/22-05:25:05 vs. onnxrt-runner-N0-09/19/22-05:25:05
[I]     Comparing Output: 'outputs' (dtype=float32, shape=(1, 128, 29)) with 'outputs' (dtype=float32, shape=(1, 128, 29))
[I]     Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-09/19/22-05:25:05: outputs | Stats: mean=-11.739, std-dev=20.581, var=423.57, median=-12.878, min=-51.211 at (0, 54, 21), max=44.076 at (0, 54, 0), avg-magnitude=19.056
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-51.2, -41.7) |        258 | ############
                (-41.7, -32.2) |        254 | ############
                (-32.2, -22.6) |        626 | ###############################
                (-22.6, -13.1) |        615 | ##############################
                (-13.1, -3.57) |        805 | ########################################
                (-3.57, 5.96 ) |        504 | #########################
                (5.96 , 15.5 ) |        394 | ###################
                (15.5 , 25   ) |        128 | ######
                (25   , 34.5 ) |          0 |
                (34.5 , 44.1 ) |        128 | ######
[I]         onnxrt-runner-N0-09/19/22-05:25:05: outputs | Stats: mean=-11.739, std-dev=20.581, var=423.57, median=-12.878, min=-51.211 at (0, 54, 21), max=44.076 at (0, 54, 0), avg-magnitude=19.056
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-51.2, -41.7) |        258 | ############
                (-41.7, -32.2) |        254 | ############
                (-32.2, -22.6) |        626 | ###############################
                (-22.6, -13.1) |        615 | ##############################
                (-13.1, -3.57) |        805 | ########################################
                (-3.57, 5.96 ) |        504 | #########################
                (5.96 , 15.5 ) |        394 | ###################
                (15.5 , 25   ) |        128 | ######
                (25   , 34.5 ) |          0 |
                (34.5 , 44.1 ) |        128 | ######
[I]         Error Metrics: outputs
[I]             Minimum Required Tolerance: elemwise error | [abs=0.00041962] OR [rel=0.0037739] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.00011682, std-dev=9.2588e-05, var=8.5725e-09, median=0.000103, min=0 at (0, 0, 9), max=0.00041962 at (0, 5, 0), avg-magnitude=0.00011682
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 4.2e-05 ) |       1018 | ########################################
                    (4.2e-05 , 8.39e-05) |        380 | ##############
                    (8.39e-05, 0.000126) |        848 | #################################
                    (0.000126, 0.000168) |        573 | ######################
                    (0.000168, 0.00021 ) |        408 | ################
                    (0.00021 , 0.000252) |        133 | #####
                    (0.000252, 0.000294) |        224 | ########
                    (0.000294, 0.000336) |          0 |
                    (0.000336, 0.000378) |          1 |
                    (0.000378, 0.00042 ) |        127 | ####
[I]             Relative Difference | Stats: mean=1.3195e-05, std-dev=6.5127e-05, var=4.2415e-09, median=5.3258e-06, min=0 at (0, 0, 9), max=0.0037739 at (0, 104, 23), avg-magnitude=1.3195e-05
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0       , 0.000377) |       3710 | ########################################
                    (0.000377, 0.000755) |          1 |
                    (0.000755, 0.00113 ) |          0 |
                    (0.00113 , 0.00151 ) |          0 |
                    (0.00151 , 0.00189 ) |          0 |
                    (0.00189 , 0.00226 ) |          0 |
                    (0.00226 , 0.00264 ) |          0 |
                    (0.00264 , 0.00302 ) |          0 |
                    (0.00302 , 0.0034  ) |          0 |
                    (0.0034  , 0.00377 ) |          1 |
[E]         FAILED | Difference exceeds tolerance (rel=1e-05, abs=1e-05)
[E]     FAILED | Mismatched outputs: ['outputs']
[!] FAILED | Command: /usr/local/bin/polygraphy run model_32.onnx --trt --onnxrt --save-engine=model_32.plan --pool-limit workspace:1G

fp16 onnx to fp16 tensorrt

polygraphy run model_16.onnx --trt --onnxrt --save-engine=model_16.plan --fp16 --pool-limit workspace:1G

[I] trt-runner-N0-09/19/22-06:26:52     | Activating and starting inference
[09/19/2022-06:26:54] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[I]     Configuring with profiles: [Profile().add('input_ids', min=[1, 128], opt=[1, 128], max=[1, 128]).add('attention_mask', min=[1, 128], opt=[1, 128], max=[1, 128]).add('rel_mask', min=[1, 128], opt=[1, 128], max=[1, 128])]
[I] Building engine with configuration:
    Workspace            | 1073741824 bytes (1024.00 MiB)
    Precision            | TF32: False, FP16: True, INT8: False, Obey Precision Constraints: False, Strict Types: False
    Tactic Sources       | ['CUBLAS', 'CUBLAS_LT', 'CUDNN', 'EDGE_MASK_CONVOLUTIONS']
    Safety Restricted    | False
    Refittable           | False
    Profiles             | 1 profile(s)
[09/19/2022-06:27:00] [TRT] [W] Weights [name=bert.embeddings.token_type_embeddings.weight] had the following issues when converted to FP16:
[09/19/2022-06:27:00] [TRT] [W]  - Subnormal FP16 values detected.
[09/19/2022-06:27:00] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
[09/19/2022-06:27:00] [TRT] [W] Weights [name=bert.embeddings.word_embeddings.weight] had the following issues when converted to FP16:
[09/19/2022-06:27:00] [TRT] [W]  - Subnormal FP16 values detected.
[09/19/2022-06:27:00] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
[09/19/2022-06:27:00] [TRT] [W] Weights [name=bert.embeddings.position_embeddings.weight] had the following issues when converted to FP16:
...
similar warning
...
[09/19/2022-06:27:00] [TRT] [W]  - Subnormal FP16 values detected.
[09/19/2022-06:27:00] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
[09/19/2022-06:27:00] [TRT] [W] Weights [name=classifier.bias + (Unnamed Layer* 1318) [Shuffle]] had the following issues when converted to FP16:
[09/19/2022-06:27:00] [TRT] [W]  - Subnormal FP16 values detected.
[09/19/2022-06:27:00] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
[09/19/2022-06:27:48] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[09/19/2022-06:27:48] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[I] Finished engine building in 54.244 seconds
[09/19/2022-06:27:49] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[09/19/2022-06:27:49] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[I] Saving engine to model_16.plan
[I] trt-runner-N0-09/19/22-06:26:52
    ---- Inference Input(s) ----
    {input_ids [dtype=int32, shape=(1, 128)],
     attention_mask [dtype=int32, shape=(1, 128)],
     rel_mask [dtype=int32, shape=(1, 128)]}
[I] trt-runner-N0-09/19/22-06:26:52
    ---- Inference Output(s) ----
    {outputs [dtype=float16, shape=(1, 128, 29)]}
[I] trt-runner-N0-09/19/22-06:26:52     | Completed 1 iteration(s) in 1.703 ms | Average inference time: 1.703 ms.
[I] onnxrt-runner-N0-09/19/22-06:26:52  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-09/19/22-06:26:52
    ---- Inference Input(s) ----
    {input_ids [dtype=int32, shape=(1, 128)],
     attention_mask [dtype=int32, shape=(1, 128)],
     rel_mask [dtype=int32, shape=(1, 128)]}
[I] onnxrt-runner-N0-09/19/22-06:26:52
    ---- Inference Output(s) ----
    {outputs [dtype=float16, shape=(1, 128, 29)]}
[I] onnxrt-runner-N0-09/19/22-06:26:52  | Completed 1 iteration(s) in 197 ms | Average inference time: 197 ms.
[I] Accuracy Comparison | trt-runner-N0-09/19/22-06:26:52 vs. onnxrt-runner-N0-09/19/22-06:26:52
[I]     Comparing Output: 'outputs' (dtype=float16, shape=(1, 128, 29)) with 'outputs' (dtype=float16, shape=(1, 128, 29))
[I]     Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-09/19/22-06:26:52: outputs | Stats: mean=-9.4609, std-dev=inf, var=inf, median=-6.9414, min=-42.312 at (0, 31, 21), max=33.781 at (0, 0, 0), avg-magnitude=13.914
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-51.2, -41.7) |        119 | ####
                (-41.7, -32.2) |        265 | ##########
                (-32.2, -22.6) |        256 | ##########
                (-22.6, -13.1) |        768 | ##############################
                (-13.1, -3.58) |        899 | ###################################
                (-3.58, 5.95 ) |       1021 | ########################################
                (5.95 , 15.5 ) |        128 | #####
                (15.5 , 25   ) |        128 | #####
                (25   , 34.5 ) |        128 | #####
                (34.5 , 44   ) |          0 |
[I]         onnxrt-runner-N0-09/19/22-06:26:52: outputs | Stats: mean=-11.734, std-dev=inf, var=inf, median=-12.859, min=-51.188 at (0, 1, 21), max=44.031 at (0, 1, 0), avg-magnitude=19.047
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-51.2, -41.7) |        258 | ############
                (-41.7, -32.2) |        254 | ############
                (-32.2, -22.6) |        626 | ###############################
                (-22.6, -13.1) |        614 | ##############################
                (-13.1, -3.58) |        806 | ########################################
                (-3.58, 5.95 ) |        504 | #########################
                (5.95 , 15.5 ) |        394 | ###################
                (15.5 , 25   ) |        128 | ######
                (25   , 34.5 ) |          0 |
                (34.5 , 44   ) |        128 | ######
[I]         Error Metrics: outputs
[I]             Minimum Required Tolerance: elemwise error | [abs=16.188] OR [rel=15.445] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=6.1875, std-dev=4.0547, var=16.438, median=5.6992, min=0 at (0, 26, 22), max=16.188 at (0, 91, 19), avg-magnitude=6.1875
[I]                 ---- Histogram ----
                    Bin Range    |  Num Elems | Visualization
                    (0   , 1.62) |        627 | ######################################
                    (1.62, 3.24) |        400 | ########################
                    (3.24, 4.86) |        656 | ########################################
                    (4.86, 6.48) |        325 | ###################
                    (6.48, 8.09) |        463 | ############################
                    (8.09, 9.71) |        508 | ##############################
                    (9.71, 11.3) |        404 | ########################
                    (11.3, 13  ) |         74 | ####
                    (13  , 14.6) |        127 | #######
                    (14.6, 16.2) |        128 | #######
[I]             Relative Difference | Stats: mean=0.52832, std-dev=0.65332, var=0.427, median=0.32251, min=0 at (0, 26, 22), max=15.445 at (0, 104, 23), avg-magnitude=0.52832
[I]                 ---- Histogram ----
                    Bin Range    |  Num Elems | Visualization
                    (0   , 1.54) |       3355 | ########################################
                    (1.54, 3.09) |        349 | ####
                    (3.09, 4.63) |          2 |
                    (4.63, 6.18) |          2 |
                    (6.18, 7.72) |          1 |
                    (7.72, 9.27) |          1 |
                    (9.27, 10.8) |          0 |
                    (10.8, 12.4) |          0 |
                    (12.4, 13.9) |          0 |
                    (13.9, 15.4) |          2 |
[E]         FAILED | Difference exceeds tolerance (rel=1e-05, abs=1e-05)
[E]     FAILED | Mismatched outputs: ['outputs']
[!] FAILED | Command: /usr/local/bin/polygraphy run model_16.onnx --trt --onnxrt --save-engine=model_16.plan --fp16 --pool-limit workspace:1G

fp32 onnx to fp16 tensorrt

polygraphy run model_32.onnx --trt --onnxrt --save-engine=model_32_to_16.plan --fp16 --pool-limit workspace:1G

[I] trt-runner-N0-09/19/22-05:12:58     | Activating and starting inference
[09/19/2022-05:13:00] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[I]     Configuring with profiles: [Profile().add('input_ids', min=[1, 128], opt=[1, 128], max=[1, 128]).add('attention_mask', min=[1, 128], opt=[1, 128], max=[1, 128]).add('rel_mask', min=[1, 128], opt=[1, 128], max=[1, 128])]
[I] Building engine with configuration:
    Workspace            | 1073741824 bytes (1024.00 MiB)
    Precision            | TF32: False, FP16: True, INT8: False, Obey Precision Constraints: False, Strict Types: False
    Tactic Sources       | ['CUBLAS', 'CUBLAS_LT', 'CUDNN', 'EDGE_MASK_CONVOLUTIONS']
    Safety Restricted    | False
    Refittable           | False
    Profiles             | 1 profile(s)
[09/19/2022-05:13:05] [TRT] [W] Weights [name=bert.embeddings.token_type_embeddings.weight] had the following issues when converted to FP16:
[09/19/2022-05:13:05] [TRT] [W]  - Subnormal FP16 values detected.
[09/19/2022-05:13:05] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
...
same as fp16
...
[I] Saving engine to model_32_to_16.plan
[I] trt-runner-N0-09/19/22-05:12:58
    ---- Inference Input(s) ----
    {input_ids [dtype=int32, shape=(1, 128)],
     attention_mask [dtype=int32, shape=(1, 128)],
     rel_mask [dtype=int32, shape=(1, 128)]}
[I] trt-runner-N0-09/19/22-05:12:58
    ---- Inference Output(s) ----
    {outputs [dtype=float32, shape=(1, 128, 29)]}
[I] trt-runner-N0-09/19/22-05:12:58     | Completed 1 iteration(s) in 1.753 ms | Average inference time: 1.753 ms.
[I] onnxrt-runner-N0-09/19/22-05:12:58  | Activating and starting inference
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[I] onnxrt-runner-N0-09/19/22-05:12:58
    ---- Inference Input(s) ----
    {input_ids [dtype=int32, shape=(1, 128)],
     attention_mask [dtype=int32, shape=(1, 128)],
     rel_mask [dtype=int32, shape=(1, 128)]}
[I] onnxrt-runner-N0-09/19/22-05:12:58
    ---- Inference Output(s) ----
    {outputs [dtype=float32, shape=(1, 128, 29)]}
[I] onnxrt-runner-N0-09/19/22-05:12:58  | Completed 1 iteration(s) in 29.69 ms | Average inference time: 29.69 ms.
[I] Accuracy Comparison | trt-runner-N0-09/19/22-05:12:58 vs. onnxrt-runner-N0-09/19/22-05:12:58
[I]     Comparing Output: 'outputs' (dtype=float32, shape=(1, 128, 29)) with 'outputs' (dtype=float32, shape=(1, 128, 29))
[I]     Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-09/19/22-05:12:58: outputs | Stats: mean=-9.4589, std-dev=15.899, var=252.79, median=-6.9727, min=-42.312 at (0, 31, 21), max=33.781 at (0, 0, 0), avg-magnitude=13.914
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-51.2, -41.7) |        119 | ####
                (-41.7, -32.2) |        265 | ##########
                (-32.2, -22.6) |        256 | ##########
                (-22.6, -13.1) |        768 | ##############################
                (-13.1, -3.57) |        899 | ###################################
                (-3.57, 5.96 ) |       1021 | ########################################
                (5.96 , 15.5 ) |        128 | #####
                (15.5 , 25   ) |        128 | #####
                (25   , 34.5 ) |        128 | #####
                (34.5 , 44.1 ) |          0 |
[I]         onnxrt-runner-N0-09/19/22-05:12:58: outputs | Stats: mean=-11.739, std-dev=20.581, var=423.57, median=-12.878, min=-51.211 at (0, 54, 21), max=44.076 at (0, 54, 0), avg-magnitude=19.056
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-51.2, -41.7) |        258 | ############
                (-41.7, -32.2) |        254 | ############
                (-32.2, -22.6) |        626 | ###############################
                (-22.6, -13.1) |        615 | ##############################
                (-13.1, -3.57) |        805 | ########################################
                (-3.57, 5.96 ) |        504 | #########################
                (5.96 , 15.5 ) |        394 | ###################
                (15.5 , 25   ) |        128 | ######
                (25   , 34.5 ) |          0 |
                (34.5 , 44.1 ) |        128 | ######
[I]         Error Metrics: outputs
[I]             Minimum Required Tolerance: elemwise error | [abs=16.238] OR [rel=146.1] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=6.1949, std-dev=4.0543, var=16.437, median=5.6168, min=0.0017872 at (0, 82, 22), max=16.238 at (0, 91, 19), avg-magnitude=6.1949
[I]                 ---- Histogram ----
                    Bin Range       |  Num Elems | Visualization
                    (0.00179, 1.63) |        631 | ######################################
                    (1.63   , 3.25) |        397 | ########################
                    (3.25   , 4.87) |        656 | ########################################
                    (4.87   , 6.5 ) |        325 | ###################
                    (6.5    , 8.12) |        461 | ############################
                    (8.12   , 9.74) |        506 | ##############################
                    (9.74   , 11.4) |        407 | ########################
                    (11.4   , 13  ) |         74 | ####
                    (13     , 14.6) |        127 | #######
                    (14.6   , 16.2) |        128 | #######
[I]             Relative Difference | Stats: mean=0.56562, std-dev=2.4804, var=6.1523, median=0.32326, min=9.2176e-05 at (0, 82, 22), max=146.1 at (0, 104, 23), avg-magnitude=0.56562
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (9.22e-05, 14.6) |       3710 | ########################################
                    (14.6    , 29.2) |          1 |
                    (29.2    , 43.8) |          0 |
                    (43.8    , 58.4) |          0 |
                    (58.4    , 73.1) |          0 |
                    (73.1    , 87.7) |          0 |
                    (87.7    , 102 ) |          0 |
                    (102     , 117 ) |          0 |
                    (117     , 131 ) |          0 |
                    (131     , 146 ) |          1 |
[E]         FAILED | Difference exceeds tolerance (rel=1e-05, abs=1e-05)
[E]     FAILED | Mismatched outputs: ['outputs']
[!] FAILED | Command: /usr/local/bin/polygraphy run model_32.onnx --trt --onnxrt --save-engine=model_32_to_16.plan --fp16 --pool-limit workspace:1G

real input and output

import torch
import tensorrt as trt
import onnxruntime


fp = 16

device = torch.device('cuda:0')
with open(f'model_{fp}.plan', 'rb') as f:
    serialized_engine = f.read()

logger = trt.Logger(trt.Logger.WARNING)
runtime = trt.Runtime(logger)
engine = runtime.deserialize_cuda_engine(serialized_engine)
context = engine.create_execution_context()

input_ids = [[2, 4665, 6964, 6091, 7119, 5837, 6691, 6124, 2155, 5575, 7020, 5468, 7095, 1311, 6267, 7828, 5073, 7088, 4756, 517, 6278, 7897, 6679, 6855, 2479, 7095, 2557, 2792, 6896, 2235, 6166, 7473, 46, 1255, 7794, 3574, 7214, 7486, 7234, 2892, 6896, 1659, 7836, 2872, 3860, 4159, 2238, 2184, 839, 7088, 517, 5330, 6615, 7941, 517, 7787, 4025, 7100, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask = [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
rel_mask = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

input_ids = torch.tensor(input_ids, dtype=torch.int32, device=device)
attention_mask = torch.tensor(attention_mask , dtype=torch.int32, device=device)
rel_mask = torch.tensor(rel_mask , dtype=torch.int32, device=device)
outputs = (torch.empty(batch_size, seq_length, 29, dtype=torch.float16 if fp == 16 else torch.float32, device=device),)

buffers = [input_ids.data_ptr(), attention_mask.data_ptr(), rel_mask.data_ptr(), outputs[0].data_ptr()]
stream = torch.cuda.Stream()

context.execute_async_v2(buffers, stream.cuda_stream)
stream.synchronize()

ort_session = onnxruntime.InferenceSession(f'model_{fp}.onnx')
onnx_output = torch.tensor(ort_session.run(None, {'input_ids': input_ids.cpu().numpy(),
                                                'attention_mask': attention_mask.cpu().numpy(),
                                                'rel_mask': rel_masks.cpu().numpy()})[0]).to(device)

print(outputs[0].argmax(2).tolist()[0][:59])
print(onnx_output.argmax(2).tolist()[0][:59])

fp32 output

onnx

[0, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

tensorrt

[0, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

fp16 output

onnx

[0, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

tensorrt

[0, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24, 24, 24, 0, 24, 0, 24, 24, 0, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

fp32 to fp16 output

onnx(fp32)

[0, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

tensorrt

[0, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24, 24, 24, 0, 24, 0, 24, 24, 0, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In tensorrt 8.2.5, there is no 'Subnormal FP16 values detected.' warning. but outputs are same. so I don't think that warning affects here.

The difference between fp32 onnx and tensorrt is higher than tolerance of polygraphy, but acceptable by my standards. The outputs of the real input is also the same.

The outputs of fp32 onnx and fp16 onnx are almost the same. nevertheless I think that the different outputs of fp32 onnx to fp16 tensorrt can happen during the optimization process.

But, I can't understand why there is a big difference between outputs of fp16 onnx to fp16 and fp16 onnx.

Sep 19 '22 22:09 Ri0S

I can reproduce this with TRT 8.5.0.9. but I can not confirm this is an accuracy bug since I see some pow layers that may amplify the diff. @pranavm-nvidia @ttyio @nzmora what do you think? Do you think we can file an internal bug for this?

[I]         trt-runner-N0-09/21/22-07:30:21: outputs | Stats: mean=-9.4583, std-dev=15.905, var=252.96, median=-6.9375, min=-42.312 at (0, 31, 21), max=33.781 at (0, 0, 0), avg-magnitude=13.911
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-51.2, -41.7) |        119 | ####
                (-41.7, -32.2) |        265 | ##########
                (-32.2, -22.6) |        256 | ##########
                (-22.6, -13.1) |        768 | ##############################
                (-13.1, -3.57) |        898 | ###################################
                (-3.57, 5.96 ) |       1022 | ########################################
                (5.96 , 15.5 ) |        128 | #####
                (15.5 , 25   ) |        128 | #####
                (25   , 34.5 ) |        128 | #####
                (34.5 , 44.1 ) |          0 |
[I]         onnxrt-runner-N0-09/21/22-07:30:21: outputs | Stats: mean=-11.739, std-dev=20.581, var=423.57, median=-12.878, min=-51.211 at (0, 54, 21), max=44.076 at (0, 54, 0), avg-magnitude=19.056
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (-51.2, -41.7) |        258 | ############
                (-41.7, -32.2) |        254 | ############
                (-32.2, -22.6) |        626 | ###############################
                (-22.6, -13.1) |        615 | ##############################
                (-13.1, -3.57) |        805 | ########################################
                (-3.57, 5.96 ) |        504 | #########################
                (5.96 , 15.5 ) |        394 | ###################
                (15.5 , 25   ) |        128 | ######
                (25   , 34.5 ) |          0 |
                (34.5 , 44.1 ) |        128 | ######
[I]         Error Metrics: outputs
[I]             Minimum Required Tolerance: elemwise error | [abs=16.245] OR [rel=138.62] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=6.1925, std-dev=4.0579, var=16.466, median=5.6266, min=0.0012302 at (0, 31, 1), max=16.245 at (0, 91, 19), avg-magnitude=6.1925
[I]                 ---- Histogram ----
                    Bin Range       |  Num Elems | Visualization
                    (0.00123, 1.63) |        627 | ######################################
                    (1.63   , 3.25) |        403 | ########################
                    (3.25   , 4.87) |        654 | ########################################
                    (4.87   , 6.5 ) |        325 | ###################
                    (6.5    , 8.12) |        463 | ############################
                    (8.12   , 9.75) |        511 | ###############################
                    (9.75   , 11.4) |        400 | ########################
                    (11.4   , 13  ) |         74 | ####
                    (13     , 14.6) |        127 | #######
                    (14.6   , 16.2) |        128 | #######
[I]             Relative Difference | Stats: mean=0.56176, std-dev=2.3574, var=5.5571, median=0.32326, min=0.00014756 at (0, 31, 1), max=138.62 at (0, 104, 23), avg-magnitude=0.56176
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (0.000148, 13.9) |       3710 | ########################################
                    (13.9    , 27.7) |          1 |
                    (27.7    , 41.6) |          0 |
                    (41.6    , 55.4) |          0 |
                    (55.4    , 69.3) |          0 |
                    (69.3    , 83.2) |          0 |
                    (83.2    , 97  ) |          0 |
                    (97      , 111 ) |          0 |
                    (111     , 125 ) |          0 |
                    (125     , 139 ) |          1 |
[E]         FAILED | Difference exceeds tolerance (rel=1e-05, abs=1e-05)
[E]     FAILED | Mismatched outputs: ['outputs']
[!] FAILED | Command: /home/zeroz/.local/bin/polygraphy run model_32.onnx --trt --onnxrt --fp16

Sep 21 '22 14:09 zerollzeng

The diff between FP32 TRT and FP32 ONNXRT is very close.

Sep 21 '22 14:09 zerollzeng

I've filed internal bug 3813586 to track this.

Sep 29 '22 14:09 zerollzeng

This is not a bug.

Model has LayeNorm subgraph in it, when running it in fp16 the results differ between ORT and TRT as ORT. This happens because this subgraph's intermediates saturates pretty fast, and ORT is using the saturated value while TRT counts it as inf(which is correct)

Forcing the pow and reducemean nodes to run in fp32 makes the diff much smaller: Minimum Required Tolerance: elemwise error | [abs=0.09375] OR [rel=0.46764] (requirements may be lower if both abs/rel tolerances are set)

where fp32 generates this diff: Minimum Required Tolerance: elemwise error | [abs=2.6703e-05] OR [rel=0.0033357] (requirements may be lower if both abs/rel tolerances are set) and the original fp16 diff is: Minimum Required Tolerance: elemwise error | [abs=16.188] OR [rel=15.535] (requirements may be lower if both abs/rel tolerances are set)

@Ri0S can we close this?

Oct 04 '22 10:10 zerollzeng

@zerollzeng Thank you for answer! I understand.

Oct 04 '22 10:10 Ri0S

TensorRT TensorRT copied to clipboard

fp16 onnx -> fp16 tensorrt mismatched outputs

Description

Environment

Relevant Files

Steps To Reproduce

convert model to onnx

fp32 onnx to fp32 tensorrt

fp16 onnx to fp16 tensorrt

fp32 onnx to fp16 tensorrt

real input and output

fp32 output

fp16 output

fp32 to fp16 output

TensorRT
TensorRT copied to clipboard