TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

TensorRT8 int8 model convtranspose layer output error

Open zhangjoey115 opened this issue 3 years ago • 9 comments

Description

Hello,

I'm debugging a INT8 model and want to mark some model layer as output. But I got error while I tried "mark all" command. image

Because mark all works fine with normal float onnx models, I'm wondering if this "mark all" function is not supported by TensorRT for INT8 models? And is there some other way to get all layer's output for debugging INT8 model? Could you help me with this issue? Thanks!

Environment

TensorRT Version: 8.4 NVIDIA GPU: 2080Ti NVIDIA Driver Version: CUDA Version: 11.6 CUDNN Version: 8.3 Operating System: Ubuntu18.04 Python Version (if applicable): 3.7 Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):

Relevant Files

Steps To Reproduce

zhangjoey115 avatar Apr 30 '22 05:04 zhangjoey115

Hello @zhangjoey115 , I assume you are using the mark all function in the polygraphy. first mark all can hidden some issue while debug accuracy, this is because without mark all output, some layers are fused together, but the fusion is broken when mark all output. The fusion kernel and unfused kernel did not share the same implementation, even you have tuned accuracy for the unfused version, you might still has accuracy issue in the fused version.

secondly, mark all not works for network contains boolean tensor or loop. This is TRT internal implementation issue and I am not sure if you hit this.

so usually we recommend to check the fusion log by enable version log level, and check the layer information section in the log, and only mark the fused layers output as network output.

ttyio avatar May 06 '22 14:05 ttyio

@ttyio Thanks a lot for detailed replay! Yes, I'm using polygraphy to debug the accuracy issue. I tried marking only several layer and it works!

But while I compare the layer output between float model and INT8 trt model, I found that one ConvTranspose2d layer give quite different output with float model. The inputs to ConvTranspose2d(=Relu output in following pic) in 2 models are almost same(diff <5%) but output are quite different(diff = 20%~ >100%). image

Do you have any clue about this difference, and is there anyway to debug into the ConvTranspose2d calculation in TRT model?

zhangjoey115 avatar May 07 '22 13:05 zhangjoey115

@ttyio By the way, I've already fixed the "axis=(0)" in pytorch_quantization code for convtranspose2d, but the results are still differ a lot. Is there any other way to fix this problem? Or I should not quant all convtranspose layers? Thanks!

zhangjoey115 avatar May 09 '22 12:05 zhangjoey115

@zhangjoey115 , sorry for the delay response, do you share a simple repro onnx model? thanks!

ttyio avatar May 17 '22 14:05 ttyio

@ttyio Thanks for response! Please find attached model for reference. The output from final convtranspose is different between onnx and trt. Looking forward your reply! convtrans.zip

zhangjoey115 avatar May 19 '22 04:05 zhangjoey115

Hello @ttyio , do you have any suggestion about the convtranspose process in model I attached? Or did I made any mistake in comparing the onnx & trt output? Looking forward your reply. Thanks!

zhangjoey115 avatar Jun 10 '22 03:06 zhangjoey115

I can repro this accuracy issue using Polygraphy.

[I] Accuracy Comparison | onnxrt-runner-N0-06/14/22-23:03:14 vs. trt-runner-N0-06/14/22-23:03:14
[I]     Comparing Output: 'output' (dtype=float32, shape=(1, 128, 128, 240)) with 'output' (dtype=float32, shape=(1, 128, 128, 240))
[I]     Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         onnxrt-runner-N0-06/14/22-23:03:14: output | Stats: mean=-0.008292, std-dev=0.049881, var=0.0024881, median=0.0014456, min=-0.30628 at (0, 109, 6, 66), max=0.14754 at (0, 37, 4, 148), avg-magnitude=0.032867
[I]             ---- Histogram ----
                Bin Range          |  Num Elems | Visualization
                (-0.306 , -0.261 ) |       1091 |
                (-0.261 , -0.216 ) |      27378 |
                (-0.216 , -0.17  ) |      12329 |
                (-0.17  , -0.125 ) |     170305 | ###
                (-0.125 , -0.0794) |     117975 | ##
                (-0.0794, -0.034 ) |     366587 | ########
                (-0.034 , 0.0114 ) |    1827683 | ########################################
                (0.0114 , 0.0568 ) |    1297045 | ############################
                (0.0568 , 0.102  ) |     109710 | ##
                (0.102  , 0.148  ) |       2057 |
[I]         trt-runner-N0-06/14/22-23:03:14: output | Stats: mean=-0.007906, std-dev=0.051306, var=0.0026323, median=0.0011116, min=-0.29945 at (0, 109, 3, 7), max=0.14445 at (0, 118, 1, 30), avg-magnitude=0.034616
[I]             ---- Histogram ----
                Bin Range          |  Num Elems | Visualization
                (-0.306 , -0.261 ) |        295 |
                (-0.261 , -0.216 ) |      16885 |
                (-0.216 , -0.17  ) |      48533 | #
                (-0.17  , -0.125 ) |     131221 | ##
                (-0.125 , -0.0794) |     143633 | ###
                (-0.0794, -0.034 ) |     395246 | ########
                (-0.034 , 0.0114 ) |    1758043 | ########################################
                (0.0114 , 0.0568 ) |    1253736 | ############################
                (0.0568 , 0.102  ) |     179307 | ####
                (0.102  , 0.148  ) |       5261 |
[I]         Error Metrics: output
[I]             Minimum Required Tolerance: elemwise error | [abs=0.12891] OR [rel=1.1654e+06] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.011486, std-dev=0.0092949, var=8.6396e-05, median=0.009384, min=3.7253e-09 at (0, 117, 55, 151), max=0.12891 at (0, 37, 4, 148), avg-magnitude=0.011486
[I]                 ---- Histogram ----
                    Bin Range          |  Num Elems | Visualization
                    (3.73e-09, 0.0129) |    2524715 | ########################################
                    (0.0129  , 0.0258) |    1104125 | #################
                    (0.0258  , 0.0387) |     248857 | ###
                    (0.0387  , 0.0516) |      40561 |
                    (0.0516  , 0.0645) |      10145 |
                    (0.0645  , 0.0773) |       3383 |
                    (0.0773  , 0.0902) |        295 |
                    (0.0902  , 0.103 ) |         56 |
                    (0.103   , 0.116 ) |         19 |
                    (0.116   , 0.129 ) |          4 |
[I]             Relative Difference | Stats: mean=3.8879, std-dev=673.27, var=4.533e+05, median=0.4068, min=1.1756e-07 at (0, 34, 125, 6), max=1.1654e+06 at (0, 50, 65, 61), avg-magnitude=3.8879
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (1.18e-07, 1.17e+05) |    3932153 | ########################################
                    (1.17e+05, 2.33e+05) |          4 |
                    (2.33e+05, 3.5e+05 ) |          2 |
                    (3.5e+05 , 4.66e+05) |          0 |
                    (4.66e+05, 5.83e+05) |          0 |
                    (5.83e+05, 6.99e+05) |          0 |
                    (6.99e+05, 8.16e+05) |          0 |
                    (8.16e+05, 9.32e+05) |          0 |
                    (9.32e+05, 1.05e+06) |          0 |
                    (1.05e+06, 1.17e+06) |          1 |
[E]         FAILED | Difference exceeds tolerance (rel=1e-05, abs=1e-05)
[E]     FAILED | Mismatched outputs: ['output']
[!] FAILED | Command: /home/pohanh/.local/bin/polygraphy run --onnxrt --trt --int8 build/convtrans.onnx

I have filed internal tracker (id: 3681310) and will let you know if we have any update.

nvpohanh avatar Jun 15 '22 06:06 nvpohanh

This has been fixed internally and will be available in TRT 8.5 release. We are also adding this to our test suite so that it doesn't happen again in the future.

nvpohanh avatar Jul 25 '22 10:07 nvpohanh

Good news! Looking forward the new version TRT!

zhangjoey115 avatar Jul 25 '22 15:07 zhangjoey115

closing since no activity for more than 14 days, please reopen if you still have question, thanks!

ttyio avatar Dec 12 '22 07:12 ttyio