onnxruntime_backend
onnxruntime_backend copied to clipboard
model with triton inference server is 3x slower than the model in ORT directly (using gpu in both)
Description I run the model on triton inference server and also on ORT directly. Inference time on triton inference server is 3 ms, but it is 1 ms on ORT. In addition, there isn't any communication overhead while running the model on triton inference server.
Triton Information Triton version I used is 22.01 and the ORT-gpu version is 1.9.0
I also used the docker image.
Expected behavior The inference time of both scenarios be the same.
I have the same issue that triton is 4x times slower than the onnx runtime with same model and in same machine
- Do you've a repro?
- Have you tried using a recent version of both Triton and ORT-GPU pkg? 4x seems like a significant difference and without a repro it'll be difficult to investigate.
Hi I just figured it out, actually triton_client needs 0.02 - 0.03s to serialize input data for my model, and the inference time of that model is only around 0.008s. That why it's 4 times slower when serving with Triton