onnxruntime_backend icon indicating copy to clipboard operation
onnxruntime_backend copied to clipboard

Triton-OnnxRt- TRT performance i

Open mayani-nv opened this issue 4 years ago • 4 comments
trafficstars

Description I downloaded the yolov3 model weights from here. Then using the Tensor-Rt sample scripts, I was able to get the corresponding onnx model file. The obtained onnx model file is similar to the one downloaded from the onnx model zoo (which uses the same weights but converted using keras2onnx)

Next, I ran the perf analyzer on this onnx model using different backends and got the following:

  1. Triton-ONNXRT-CUDA: Used the .onnx model file and run with the onnxruntime backend and got the following output
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 0.6 infer/sec, latency 1498616 usec
Concurrency: 2, throughput: 0.8 infer/sec, latency 2237485 usec
Concurrency: 3, throughput: 0.6 infer/sec, latency 3406846 usec
Concurrency: 4, throughput: 0.6 infer/sec, latency 4570913 usec
  1. Triton-ONNXRT-TRT: Used the .onnx model file but added the gpu accelarator as tensorrt (still ran with the onnxruntime backend) and got the following output
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1.2 infer/sec, latency 854637 usec
Concurrency: 2, throughput: 2 infer/sec, latency 1011748 usec
Concurrency: 3, throughput: 1.8 infer/sec, latency 1516845 usec
Concurrency: 4, throughput: 1.8 infer/sec, latency 2023850 usec
  1. Triton-TRT: Converted the .onnx file to the .trt file. Ran with the tensor-rt backend and got following Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 34.4 infer/sec, latency 29134 usec
Concurrency: 2, throughput: 66 infer/sec, latency 30218 usec
Concurrency: 3, throughput: 64.6 infer/sec, latency 46344 usec
Concurrency: 4, throughput: 70.8 infer/sec, latency 56346 usec

Why is the performance on the Triton-OnnxRT-TRT backend slow compared to the Triton-TRT backend. I used the Quadro RTX 8000 (same Turing architecture as T4) for this experiment.

Triton Information NGC container v20.12

mayani-nv avatar Feb 19 '21 19:02 mayani-nv

By the way, @mayani-nv communicated to me - All experimental results are on FP32.

ppyun avatar Feb 26 '21 00:02 ppyun

@mayani-nv : This PR https://github.com/triton-inference-server/onnxruntime_backend/pull/42 to enable io binding should help with perf. Can you run your tests again once this is checked in?

I have not done any perf profiling for this model so cant say for sure if this PR will bring perf on par but it should definitely help.

Did you mean container version 21.02?

askhade avatar May 28 '21 16:05 askhade

I tried running the above tests with Triton v21.09 container and am ORT-TRT-Triton with FP32 enabled and getting following

Concurrency: 1, throughput: 0.8 infer/sec, latency 1252700 usec
Concurrency: 2, throughput: 1.1 infer/sec, latency 1842821 usec
Concurrency: 3, throughput: 1 infer/sec, latency 2780213 usec
Concurrency: 4, throughput: 1 infer/sec, latency 3710178 usec

I also tried with the ORT-TRT-Triton with FP16 enabled and getting following

Concurrency: 1, throughput: 1.42857 infer/sec, latency 718673 usec
Concurrency: 2, throughput: 2.42857 infer/sec, latency 817400 usec
Concurrency: 3, throughput: 2.42857 infer/sec, latency 1229651 usec
Concurrency: 4, throughput: 2.42857 infer/sec, latency 1644229 usec

I am not sure if the speed by using ORT-TRT in triton (FP16 v/s FP32) is still considerably slower than inferencing with the pure TRT model. So is this expected behavior?

mayani-nv avatar Nov 05 '21 20:11 mayani-nv

Noting that I also see much lower performance from ORT-TRT than TRT outside of Triton.

rgov avatar Apr 29 '22 15:04 rgov