onnxruntime
onnxruntime copied to clipboard
TensorRT Provider Vs TensorRT Native
Hello, i want ask some questions about benchmark performance and inference time between onnx with TensorRT provider with TensorRT native. Is there onnx will be slow or get worse performance rather than TensorRT native? Because it's first time i use onnx with TensorRT, usually i use CUDA provider. Thanks
TensorRT EP can achieve performance parity with native TensorRT. One of the benefits to use TensorRT EP is to run models that can't run in native TensorRT if there are TensorRT unsupported ops in the model. Those ops will fallback to other EPs like CUDA or CPU in OnnxRuntime automatically.
TensorRT EP can achieve performance parity with native TensorRT. One of the benefits to use TensorRT EP is to run models that can't run in native TensorRT if there are TensorRT unsupported ops in the model. Those ops will fallback to other EPs like CUDA or CPU in OnnxRuntime automatically.
Is there any benchmark for it?
Any reproducible benchmark on this?
@pommedeterresautee Maybe you have some insights on this? I read https://github.com/ELS-RD/transformer-deploy/blob/d397869e95ee07570c47edefec01bdc673391b65/docs/faq.md#why-dont-you-support-gpu-quantization-on-onnx-runtime-instead-of-tensorrt , but it's not clear to me why ONNX Runtime + TensorrtExecutionProvider
would be worse than Tensor RT native, given that you start from an onnx QDQ model? I had no issue with fp32 and int8 (static) quantized models with the Tensor RT execution provider.