server
server copied to clipboard
tensorrt slower than onnx
Description A clear and concise description of what the bug is.
I tested my model as ONNX and as trt on Triton using perf_analyzer. I didn't test them any other way because I don't know how. tensorrt came worse which is weird because it should be an optimized version of that same onnx model. The weird thing I noticed (which might and might not be the cause) is that the "compute input" part is much worse for tensorrt than it is for onnx. For onnx it usually was between 150-200 and for tensorrt it was between 2000-5500 (usually around 4000). Also another weird thing I noticed is that when I test with higher concurrency, the onnx model's "compute input", "compute output" and "compute infer" times are increasing a bit, but for tensorrt it was the opposite (maybe because it was more consistently filling up the dynamic batcher to the fullest which is the shape that the tensorrt model was optimized for?)
Triton Information What version of Triton are you using? 22.01
Are you using the Triton container or did you build it yourself? Using the container.
Expected behavior A clear and concise description of what you expected to happen.
I expected tensorrt to be at least somewhat faster than onnx but it is actually slightly slower.
@naor2013 do you mean you used the TensorRT backend or the TensorRT optimization option for OnnxRuntime backend? Can you share the model and perf analyzer configuration you used?
@CoderHam I used the TensorRT backend. The model is NVIDIA's Conformer pre-trained model: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_small
The .onnx and .plan files and the .pbtxt files can be found here: https://ufile.io/f/fxyta
The perf analyzer configuration I used is: perf_analyzer -m model --shape audio_signal:80,200 -v --concurrency-range 24:32:8 -I grpc I tried it with/without grpc, async and c_api, all with similar results.
@naor2013 thank you for sharing the same. Can you also share the logs from perf analyzer that show the throughout and detailed breakdown of latency?
@CoderHam Sure.
Onnx: Request concurrency: 24 Client: Request count: 7639 Throughput: 763.9 infer/sec Avg latency: 31348 usec (standard deviation 4727 usec) p50 latency: 32541 usec p90 latency: 33637 usec p95 latency: 35001 usec p99 latency: 45777 usec
Server: Inference count: 9137 Execution count: 706 Successful request count: 706 Avg request latency: 7120124 usec (overhead 0 usec + queue 45005703 usec + compute input 176 usec + compute infer 21330 usec + compute output 51 usec)
Request concurrency: 32 Client: Request count: 8409 Throughput: 840.9 infer/sec Avg latency: 38077 usec (standard deviation 7038 usec) p50 latency: 38657 usec p90 latency: 44181 usec p95 latency: 53267 usec p99 latency: 54884 usec
Server: Inference count: 10077 Execution count: 576 Successful request count: 576 Avg request latency: 19389634 usec (overhead 0 usec + queue 53803964 usec + compute input 191 usec + compute infer 23385 usec + compute output 55 usec)
Inferences/Second vs. Client Average Batch Latency Concurrency: 24, throughput: 763.9 infer/sec, latency 31348 usec Concurrency: 32, throughput: 840.9 infer/sec, latency 38077 usec
Tensorrt: Request concurrency: 24 Client: Request count: 6456 Throughput: 645.6 infer/sec Avg latency: 37165 usec (standard deviation 311 usec) p50 latency: 37136 usec p90 latency: 37519 usec p95 latency: 37615 usec p99 latency: 37829 usec
Server: Inference count: 7752 Execution count: 646 Successful request count: 646 Avg request latency: 5556271 usec (overhead 0 usec + queue 11076610 usec + compute input 4505 usec + compute infer 24551 usec + compute output 63 usec)
Request concurrency: 32 Client: Request count: 7168 Throughput: 716.8 infer/sec Avg latency: 44555 usec (standard deviation 839 usec) p50 latency: 44506 usec p90 latency: 45349 usec p95 latency: 45648 usec p99 latency: 46579 usec
Server: Inference count: 8608 Execution count: 538 Successful request count: 538 Avg request latency: 2507445 usec (overhead 0 usec + queue 17499290 usec + compute input 4010 usec + compute infer 23843 usec + compute output 48 usec)
Inferences/Second vs. Client Average Batch Latency Concurrency: 24, throughput: 645.6 infer/sec, latency 37165 usec Concurrency: 32, throughput: 716.8 infer/sec, latency 44555 usec
@naor2013 thank you for sharing all the information. We will investigate this once we have the bandwidth to do so.
@CoderHam Any update on this. Found similar issue.
Even if I don't set dynamic batching my queuing is very high when I use tensorrt.
@CoderHam I used the TensorRT backend. The model is NVIDIA's Conformer pre-trained model: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_small
The .onnx and .plan files and the .pbtxt files can be found here: https://ufile.io/f/fxyta
The perf analyzer configuration I used is: perf_analyzer -m model --shape audio_signal:80,200 -v --concurrency-range 24:32:8 -I grpc I tried it with/without grpc, async and c_api, all with similar results.
Hello, could you please share the configuration for ONNX? Alas, the files at this link: https://ufile.io/f/fxyta are no longer available.