server icon indicating copy to clipboard operation
server copied to clipboard

Inconsistent prediction results using onnx backend with tensorrt enabled

Open fangpings opened this issue 6 months ago • 0 comments

Description We observed that with tensorrt enabled, onnx backend is giving inconsistent result if there are lots of concurrent requests coming in. This means the same input could end up in completely different results. With trt disabled, we did not see this issue. We also noticed that this behavior only happens when max batch size is larger than 1.

Triton Information What version of Triton are you using?

  • 24.05

Are you using the Triton container or did you build it yourself?

  • We are using out-of-box Triton container

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

This is our inference config file. Our model is converted from pytorch to onnx.

name: "page_classification_inference"
backend: "onnxruntime"
max_batch_size: 64


optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }
    parameters { key: "max_workspace_size_bytes" value: "1073741824" }}
  ]
}}

We then run a simple inference job on the same set of documents for 2 times, collected the responses and compared.

Among 137384 documents, only 69536 documents have the same predictions.

As comparison, we also ran an experiment with trt disabled, and we saw that 137345 documents have the same prediction results.

Also, we noticed that when setting max batch size to 1, this issue is also gone.

Expected behavior With trt enabled, it should give consistent results on same input.

fangpings avatar Aug 16 '24 23:08 fangpings