TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

fp16 engine works fine with trtexec but encountering nan with python API

Open sandeepgadhwal opened this issue 1 year ago • 7 comments

Description

I try to create a tensorrt engine from an onnx model

trtexec --onnx=model.onnx --saveEngine=engine.trt --fp16

When i use trtexec for inference it works fine

trtexec --loadEngine=engine.trt  --fp16  --exportOutput=f
p16e.json

I can see that the output is just fine fp16e.json.

But when I try to run this engine file with python API the output is nan.

Environment

TensorRT Version: 8.6

NVIDIA GPU: RTX 3070

NVIDIA Driver Version: 535.154.05

CUDA Version: 12.2

CUDNN Version: 8.9

Operating System:

Python Version (if applicable): 3.10

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Here is the python script I am using tensorrt_test_fps.py.txt

Have you tried the latest release?: yes

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): Yes it runs fine when using CUDAExecutionProvider with ONNXRuntime but with TensorrtExecutionProvider I get the same issue.

sandeepgadhwal avatar Feb 09 '24 11:02 sandeepgadhwal

I am not sure if the values are overflowing.

  • How to check if any activation is overflowing.
  • Why trtexec works fine and only python API returns nan values ?

sandeepgadhwal avatar Feb 09 '24 11:02 sandeepgadhwal

One of the reasons for nan i found what that with random outputs the model does not output nan, But with zeros it does so. what does trtexec use as input for the model ?

sandeepgadhwal avatar Feb 09 '24 19:02 sandeepgadhwal

Could you please try validate the output with polygraphy run model.onnx --trt --fp16 --onnxrt.

See https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy

zerollzeng avatar Feb 13 '24 07:02 zerollzeng

The output of this command is here

[E]         FAILED | Output: '37293' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)
[E]     FAILED | Mismatched outputs: ['output', '37293']
[E] Accuracy Summary | trt-runner-N0-02/13/24-17:35:13 vs. onnxrt-runner-N0-02/13/24-17:35:13 | Passed: 0/1 iterations | Pass Rate: 0.0%
[E] FAILED | Runtime: 461.335s |

sandeepgadhwal avatar Feb 13 '24 08:02 sandeepgadhwal

On increasing tolerence it works fine

[I]         PASSED | Output: '37293' | Difference is within tolerance (rel=0.1, abs=0.1)
[I]     PASSED | All outputs matched | Outputs: ['output', '37293']
[I] Accuracy Summary | trt-runner-N0-02/13/24-17:55:15 vs. onnxrt-runner-N0-02/13/24-17:55:15 | Passed: 1/1 iterations | Pass Rate: 100.0%

sandeepgadhwal avatar Feb 13 '24 09:02 sandeepgadhwal

happened to me because of the model weights, some were smaller than fp16 min value. those weigths are clamped to zeros, then because of some zeros divisions, nans appear. To confirm whether the problem stems from your weights, attempt an export with --fp32. If so, retrain your model with fp16 precision before exporting to onnx.

Data-Iab avatar Feb 14 '24 11:02 Data-Iab

I previously had little knowledge about debugging the internals of a TensorRT engine but now I am able to debug using additional bindings to find the source of overflow/underflow. My original issue was to know how to debug.

Meanwhile I found that if we use --best flag instead of --fp16 the it works fine, I can see meaningful output from the model. It would autmatically find layers that need to stay FP32. --fp16 flag seems to be more restrictive.

But I agree with you @Data-Iab that to get a good model we need to retrain the model.

sandeepgadhwal avatar Feb 15 '24 06:02 sandeepgadhwal