onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

[Performance] Dynamic Shape performance

Open SWHL opened this issue 3 years ago • 4 comments

Describe the issue

  • I am using the onnxrutime to make inference on CPU and GPU. The input I use for the model is dynamic shape.
  • Whether it is CPU or GPU, onnxruntime inference time on static shape input is shorted than dynamic shape input.
  • Is there a way to optimize the inference time of the model in the case of dynamic input?

To reproduce

import numpy as np
import onnxruntime as ort
from tqdm import tqdm
import time

class TestOrtInfer(object):
    def __init__(self, onnx_path, batch_size=1, total_samples=1000):
        self.onnx_path = onnx_path
        self.total_samples = total_samples
        self.batch_size = batch_size
        self.x = np.random.randn(*[batch_size, 3, 224, 224]).astype(np.float32)

    def init_session(self, use_gpu=False):
        self.use_gpu = use_gpu
        if self.use_gpu:
            exproviders = ["CUDAExecutionProvider", "CPUExecutionProvider"]
        else:
            exproviders = ["CPUExecutionProvider"]

        self.ort_session = ort.InferenceSession(self.onnx_path,
                                                providers=exproviders)
        self.input_name = self.ort_session.get_inputs()[0].name
        self.output_name = self.ort_session.get_outputs()[0].name

    def infer(self, is_dynamic=False):
        latency = []
        print('Number of runs:', self.total_samples)
        for i in tqdm(range(self.total_samples)):
            if is_dynamic:
                w = np.random.randint(128, 1024)
                w = int(round(w / 32) * 32)

                h = np.random.randint(128, 1024)
                h = int(round(h / 32) * 32)
            else:
                h, w = 576, 576

            self.x = np.random.randn(*[self.batch_size, 3, h, w]).astype(np.float32)

            t0 = time.time()
            self.ort_session.run(None, {self.input_name: self.x})
            latency.append(time.time() - t0)

        avg_time = sum(latency) * 1000 / len(latency)
        device = 'GPU' if self.use_gpu else 'CPU'
        print(f"Average onnxruntime {device} " \
              f"Inference time = {avg_time:.2f} ms")


onnx_path = 'OCRv3_det_infer.onnx'
tester = TestOrtInfer(onnx_path, batch_size=1, total_samples=100)

# CPU Inference
tester.init_session(use_gpu=False)
tester.infer(is_dynamic=False)
tester.infer(is_dynamic=True)

# GPU Inference
tester.init_session(use_gpu=True)
tester.infer(is_dynamic=False)
tester.infer(is_dynamic=True)
  • The result:

    Device Model Input shape Loops Average cost
    CPU OCRv3_det_infer.onnx 1x3x576x576 100 283.94ms
    CPU OCRv3_det_infer.onnx 1x3xHxW dynamic 100 321.17ms
    GPU OCRv3_det_infer.onnx 1x3x576x576 100 11.71ms
    GPU OCRv3_det_infer.onnx 1x3xHxW dynamic 100 445.36ms

Urgency

No response

Platform

Linux

OS Version

Ubuntu

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.2

Model File

OCRv3_det_infer.zip

Is this a quantized model?

No

SWHL avatar Oct 02 '22 08:10 SWHL

You can try to run the same shape 10 times and discard the time from the first run. Your number should be comparable to the static ones. If you keep changing the shape for each run, a lot of cached data will be invalidated and rebuilt.

ytaous avatar Oct 10 '22 23:10 ytaous

I have the same problems when dealing with recognition model of PaddleOCR, due to dynamic shapes [-1, 3, 48, -1]. My suggestion is to warmup the model before doing inference, here snippet of model warmups

    ...
    def model_warmup(self, batch_size: int = 1, min_size: int = 300, max_size: int = 1500):
        """
        Recognition model have input size: [-1, 3, 48, -1]
        ONNXRuntime with CUDA support is not performing well with arbitrary input size
        So we need to warmup the model with arbitrary input size
        """
        log.info("Warming up model...")
        for i in tqdm(range(min_size, max_size), desc="Warming up model"):
            dummy_input = np.random.randn(batch_size, 3, 48, i).astype(np.float32)
            self.recog_session.run([self.recog_output_name], {self.recog_input_name: dummy_input})
        log.info("Model warmup completed")
    ...

ruhyadi avatar May 19 '23 07:05 ruhyadi

You can try to run the same shape 10 times and discard the time from the first run. Your number should be comparable to the static ones. If you keep changing the shape for each run, a lot of cached data will be invalidated and rebuilt.

@ytaous How many cache will ORT preserve for each model? For example, I have one model and inference 10 times with different input shapes. Will ORT preserve the last one shape cache, or last N shape cache? Or This is decided by another algorithm?

CDboyOne avatar Jul 19 '24 03:07 CDboyOne

Any updates on this issue?

PhenomenaPh avatar Oct 16 '24 11:10 PhenomenaPh

Any updates?

Januek avatar Jan 26 '25 05:01 Januek

keep active

ningpp avatar Jul 08 '25 00:07 ningpp

keep active

PhenomenaPh avatar Sep 15 '25 08:09 PhenomenaPh

Applying stale label due to no activity in 30 days

any update?

ningpp avatar Nov 15 '25 03:11 ningpp

Applying stale label due to no activity in 30 days