onnxruntime_backend model with triton inference server is 3x slower than the model in ORT directly (using gpu in both)

model with triton inference server is 3x slower than the model in ORT directly (using gpu in both)

Open farzanehnakhaee70 opened this issue 3 years ago • 3 comments

trafficstars

Description I run the model on triton inference server and also on ORT directly. Inference time on triton inference server is 3 ms, but it is 1 ms on ORT. In addition, there isn't any communication overhead while running the model on triton inference server.

Triton Information Triton version I used is 22.01 and the ORT-gpu version is 1.9.0

I also used the docker image.

Expected behavior The inference time of both scenarios be the same.

May 04 '22 14:05 farzanehnakhaee70

I have the same issue that triton is 4x times slower than the onnx runtime with same model and in same machine

Mar 05 '23 02:03 vulong3896

Do you've a repro?
Have you tried using a recent version of both Triton and ORT-GPU pkg? 4x seems like a significant difference and without a repro it'll be difficult to investigate.

Mar 06 '23 22:03 pranavsharma

Hi I just figured it out, actually triton_client needs 0.02 - 0.03s to serialize input data for my model, and the inference time of that model is only around 0.008s. That why it's 4 times slower when serving with Triton

Mar 08 '23 01:03 vulong3896

onnxruntime_backend onnxruntime_backend copied to clipboard

model with triton inference server is 3x slower than the model in ORT directly (using gpu in both)

onnxruntime_backend
onnxruntime_backend copied to clipboard