TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Significant output discrepancy between TensorRT engine and ONNX Runtime inference outputs

Open WoodieDudy opened this issue 7 months ago • 4 comments

Description

When running inference on a TensorRT engine built from an ONNX model, I observe significant discrepancies between TensorRT and ONNX Runtime outputs.
The difference is not minor - mean and max deviations are large across outputs.

Observed output deviations:

Output 0: Mean deviation = 3.970133066177368, Max deviation = 16.034887313842773

Environment

TensorRT Version: 10.9.0.34

NVIDIA GPU: A100 40GB

NVIDIA Driver Version: 550.127.05

CUDA Version: 12.8

CUDNN Version: 9.8.0

Operating System: Ubuntu 24.04

Python Version: 3.12.3

PyTorch Version: 2.6.0+cu124

Container: nvcr.io/nvidia/tensorrt:25.03-py3

Steps To Reproduce

Commands or scripts:

  1. Export ONNX model from PyTorch using the provided script. https://gist.github.com/WoodieDudy/f91209ff64d3d84e1fab7d8860f18d42 Or download onnx file https://drive.google.com/file/d/1ItOgKQtcg47lqooq9G1Qi7pLz6qv1fYG/view?usp=sharing
  2. Build TensorRT engine:
    trtexec --onnx=model_static.onnx --saveEngine=model_static.engine
    
  3. Run the following Python script for inference and comparison:
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import onnxruntime

class TRTInference:
    def __init__(self, engine_path: str):
        with open(engine_path, 'rb') as f:
            engine_data = f.read()
        runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
        self.engine = runtime.deserialize_cuda_engine(engine_data)
        self.context = self.engine.create_execution_context()
        self.stream = cuda.Stream()
        self.input_tensor_indices = []
        self.output_tensor_indices = []
        for i in range(self.engine.num_io_tensors):
            tensor_name = self.engine.get_tensor_name(i)
            mode = self.engine.get_tensor_mode(tensor_name)
            if mode == trt.TensorIOMode.INPUT:
                self.input_tensor_indices.append(i)
            else:
                self.output_tensor_indices.append(i)

    def infer(self, input_tensors: list):
        num_tensors = self.engine.num_io_tensors
        bindings = [None] * num_tensors
        for idx, tensor_index in enumerate(self.input_tensor_indices):
            input_array = input_tensors[idx]
            tensor_name = self.engine.get_tensor_name(tensor_index)
            self.context.set_input_shape(tensor_name, input_array.shape)
            input_mem = cuda.mem_alloc(input_array.nbytes)
            cuda.memcpy_htod_async(input_mem, input_array, self.stream)
            bindings[tensor_index] = int(input_mem)
        output_buffers = {}
        for tensor_index in self.output_tensor_indices:
            tensor_name = self.engine.get_tensor_name(tensor_index)
            out_shape = self.context.get_tensor_shape(tensor_name)
            dtype = trt.nptype(self.engine.get_tensor_dtype(tensor_name))
            nbytes = np.prod(out_shape) * np.dtype(dtype).itemsize
            output_mem = cuda.mem_alloc(int(nbytes))
            bindings[tensor_index] = int(output_mem)
            output_buffers[tensor_index] = (output_mem, out_shape, dtype)
        for i in range(num_tensors):
            tensor_name = self.engine.get_tensor_name(i)
            self.context.set_tensor_address(tensor_name, bindings[i])
        self.context.execute_async_v3(stream_handle=self.stream.handle)
        self.stream.synchronize()
        outputs = []
        for tensor_index in self.output_tensor_indices:
            output_mem, out_shape, dtype = output_buffers[tensor_index]
            host_output = np.empty(out_shape, dtype=dtype)
            cuda.memcpy_dtoh(host_output, output_mem)
            outputs.append(host_output)
        return outputs

batch_size = 16
input_x_np = np.random.rand(batch_size, 240_000).astype(np.float32)
input_xlen_np = np.ones((batch_size,), dtype=np.float32)
engine_path = 'model_static.engine'
inference_engine = TRTInference(engine_path)
trt_outputs = inference_engine.infer([input_x_np, input_xlen_np])
print("TensorRT outputs:")
for idx, output in enumerate(trt_outputs):
    print(f"Output {idx} shape: {output.shape}")

ort_session = onnxruntime.InferenceSession(
    'model_static.onnx',
    providers=['CUDAExecutionProvider'],
    disabled_optimizers=["SkipLayerNormFusion"]
)
input_names = [inp.name for inp in ort_session.get_inputs()]
ort_inputs = {input_names[0]: input_x_np, input_names[1]: input_xlen_np}
ort_outputs = ort_session.run(None, ort_inputs)
print("ONNX Runtime outputs:")
for idx, output in enumerate(ort_outputs):
    print(f"Output {idx} shape: {output.shape}")

for idx, (trt_output, ort_output) in enumerate(zip(trt_outputs, ort_outputs)):
    diff = np.abs(trt_output - ort_output)
    print(f"Output {idx}: Mean deviation = {np.mean(diff)}, Max deviation = {np.max(diff)}")

Have you tried the latest release?: Yes, using container 25.03.

WoodieDudy avatar Apr 10 '25 12:04 WoodieDudy

I would suggest testing with smaller subgraphs of the model to narrow down where the difference starts to occur. I see the model contains accumulation ops such as LayerNorm. Setting the accumulation precision to fp32 can improve the accuracy.

The following polygraphy command allows to to compare between ONNXRT and TRT more easily:

polygraphy run --trt --onnxrt model_static.onnx

For verbose mode add -vv.

yuanyao-nv avatar Apr 22 '25 21:04 yuanyao-nv

@yuanyao-nv It appears that @WoodieDudy does convert LayerNorm's to fp32 before the export: https://gist.github.com/WoodieDudy/f91209ff64d3d84e1fab7d8860f18d42#file-export_to_onnx-py-L407-L410 ...

But maybe it happens after prior conversion of LayerNorm weights to fp16 in https://gist.github.com/WoodieDudy/f91209ff64d3d84e1fab7d8860f18d42#file-export_to_onnx-py-L386, but LayerNorm itself still seems to happen in fp32 - at least in PyTorch code...

The accuracy drop is a bit strange for a Nvidia's Nemo model is used in the first place. I wonder if such onnx export/ort+trt accuracy test should be upstreamed to Nvidia's Nemo?

vadimkantorov avatar Apr 23 '25 22:04 vadimkantorov

Issue has not received an update in over 14 days. Adding stale label. Please note the issue will be closed in 14 days after being marked stale if there is no update.

github-actions[bot] avatar Jun 18 '25 23:06 github-actions[bot]

Stale bump...

vadimkantorov avatar Jun 18 '25 23:06 vadimkantorov