onnxruntime_backend icon indicating copy to clipboard operation
onnxruntime_backend copied to clipboard

CPU inference is much slower than with ONNX Runtime directly

Open artmatsak opened this issue 3 years ago • 10 comments

Description Our Electra-based model takes about 540 ms per inference on CPU with ONNX Runtime (via the mcr.microsoft.com/azureml/onnxruntime:v1.4.0 container). The same model run through Triton r21.02 takes 1000+ ms on average. We've also tried with Triton r20.09, same result.

Triton Information 21.02

Are you using the Triton container or did you build it yourself? Container, nvcr.io/nvidia/tritonserver:21.02-py3 and nvcr.io/nvidia/tritonserver:20.09-py3.

To Reproduce

I cannot share the full model but it's a PyTorch Transformer-based model exported from HuggingFace to ONNX.

Expected behavior The inference time on CPU in Triton should be about the same as in ONNX Runtime directly.

artmatsak avatar Mar 19 '21 16:03 artmatsak

Will investigate this and report back on my findings

askhade avatar Sep 13 '21 18:09 askhade

Hi @artmatsak, Here is my experiment about the CPU performance issue with certain bert model (bert-squad):

Approach ms/sec
Standalone ORT perf_test GPU 15.12ms
Triton r21.08 GPU 13.7ms
Standalone ORT perf_test CPU 223.168ms
Triton r21.08 CPU 666ms
Triton r21.08 CPU (remove thread=1) 227ms

Previously triton ORT backend always set the number of threads used to parallelize the execution as 1 and it has been fixed recently: https://github.com/triton-inference-server/onnxruntime_backend/pull/67 The fix has been included in the recent 21.09 release. As you can see, removing thread=1 can make the CPU performance on Triton at least be close to the Standalone ORT with CPU.

However, you bumped into this issue with Triton 21.02 and at that time triton ORT backend was still using openmp and thread_number did not take any effect. Thus, things might be much different now (it does not use openmp anymore). Could you please try whether the latest 21.09 resolve the CPU performance issue with your Bert model? Let's confirm that whether this issue has been resolved in the latest build. Thank you!

jcwchen avatar Oct 04 '21 21:10 jcwchen

I use Triton ORT 21.09-py3 I also have the same problem too. When I run pref the cpu is about 100%, but the QPS is not have a ideal accelerate.

johnsGuo avatar Nov 09 '21 03:11 johnsGuo

@GuoGuiRong Are you comparing ORT with Triton-ORT? Can you add more details regarding

  1. ORT and Triton-ORT configs used durnig testing
  2. What perf diff are you seeing

askhade avatar Dec 01 '21 20:12 askhade

Hi, I'm getting the same issue, running ORT directly is about 3x faster. I am using HuggingFace transformers.onnx library to convert the model to ORT, and run it using the onnxruntime python client lib.

For the Triton model config this: `name: "paraphrase-MiniLM-L6-v2" platform: "onnxruntime_onnx" max_batch_size: 0

input [ { name: "input_ids" data_type: TYPE_INT64 dims: [-1,-1] }, { name: "token_type_ids" data_type: TYPE_INT64 dims: [-1,-1] }, { name: "attention_mask" data_type: TYPE_INT64 dims: [-1,-1] } ]

output { name: "last_hidden_state" data_type: TYPE_FP32 dims: [-1,-1,-1] }`

I have tried the various optimisation parameters suggested in the backend repo, but these seem to make the performance worse.

bezdomniy avatar Feb 15 '22 05:02 bezdomniy

Is there any update for this issue?

farzanehnakhaee70 avatar May 04 '22 13:05 farzanehnakhaee70

Hi @artmatsak, Here is my experiment about the CPU performance issue with certain bert model (bert-squad):

Approach ms/sec Standalone ORT perf_test GPU 15.12ms Triton r21.08 GPU 13.7ms Standalone ORT perf_test CPU 223.168ms Triton r21.08 CPU 666ms Triton r21.08 CPU (remove thread=1) 227ms Previously triton ORT backend always set the number of threads used to parallelize the execution as 1 and it has been fixed recently: #67 The fix has been included in the recent 21.09 release. As you can see, removing thread=1 can make the CPU performance on Triton at least be close to the Standalone ORT with CPU.

However, you bumped into this issue with Triton 21.02 and at that time triton ORT backend was still using openmp and thread_number did not take any effect. Thus, things might be much different now (it does not use openmp anymore). Could you please try whether the latest 21.09 resolve the CPU performance issue with your Bert model? Let's confirm that whether this issue has been resolved in the latest build. Thank you!

Thanks for sharing your experiences. In your table, there is 2 ms difference between the inference time of the model on GPU for triton and also ORT directly. Does anybody know why this difference existed?

farzanehnakhaee70 avatar May 04 '22 13:05 farzanehnakhaee70