onnxruntime_backend icon indicating copy to clipboard operation
onnxruntime_backend copied to clipboard

model with triton inference server is 3x slower than the model in ORT directly (using gpu in both)

Open farzanehnakhaee70 opened this issue 3 years ago • 3 comments
trafficstars

Description I run the model on triton inference server and also on ORT directly. Inference time on triton inference server is 3 ms, but it is 1 ms on ORT. In addition, there isn't any communication overhead while running the model on triton inference server.

Triton Information Triton version I used is 22.01 and the ORT-gpu version is 1.9.0

I also used the docker image.

Expected behavior The inference time of both scenarios be the same.

farzanehnakhaee70 avatar May 04 '22 14:05 farzanehnakhaee70

I have the same issue that triton is 4x times slower than the onnx runtime with same model and in same machine

vulong3896 avatar Mar 05 '23 02:03 vulong3896

  1. Do you've a repro?
  2. Have you tried using a recent version of both Triton and ORT-GPU pkg? 4x seems like a significant difference and without a repro it'll be difficult to investigate.

pranavsharma avatar Mar 06 '23 22:03 pranavsharma

Hi I just figured it out, actually triton_client needs 0.02 - 0.03s to serialize input data for my model, and the inference time of that model is only around 0.008s. That why it's 4 times slower when serving with Triton

vulong3896 avatar Mar 08 '23 01:03 vulong3896