mmpose [Bug] Why ONNX with RTMO takes so long?

[Bug] Why ONNX with RTMO takes so long?

Open Daanfb opened this issue 10 months ago • 0 comments

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmpose).

Environment

04/04 12:28:41 - mmengine - INFO -

04/04 12:28:41 - mmengine - INFO - Environmental information 04/04 12:28:45 - mmengine - INFO - sys.platform: win32 04/04 12:28:45 - mmengine - INFO - Python: 3.8.19 (default, Mar 20 2024, 19:55:45) [MSC v.1916 64 bit (AMD64)] 04/04 12:28:45 - mmengine - INFO - CUDA available: True 04/04 12:28:45 - mmengine - INFO - MUSA available: False 04/04 12:28:45 - mmengine - INFO - numpy_random_seed: 2147483648 04/04 12:28:45 - mmengine - INFO - GPU 0: NVIDIA GeForce RTX 2060 04/04 12:28:45 - mmengine - INFO - CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6 04/04 12:28:45 - mmengine - INFO - NVCC: Cuda compilation tools, release 11.6, V11.6.55 04/04 12:28:45 - mmengine - INFO - MSVC: Compilador de optimización de C/C++ de Microsoft (R) versión 19.39.33523 para x64 04/04 12:28:45 - mmengine - INFO - GCC: n/a 04/04 12:28:45 - mmengine - INFO - PyTorch: 2.2.1+cu118 04/04 12:28:45 - mmengine - INFO - PyTorch compiling details: PyTorch built with:

C++ Version: 201703
MSVC 192930151
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
OpenMP 2019
LAPACK is enabled (usually provided by MKL)
CPU capability usage: AVX2
CUDA Runtime 11.8
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37
CuDNN 8.7
Magma 2.5.4
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /Zc:__cplusplus /bigobj /FS /utf-8 -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE /wd4624 /wd4068 /wd4067 /wd4267 /wd4661 /wd4717 /wd4244 /wd4804 /wd4273, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

04/04 12:28:45 - mmengine - INFO - TorchVision: 0.17.1+cu118 04/04 12:28:45 - mmengine - INFO - OpenCV: 4.8.0 04/04 12:28:45 - mmengine - INFO - MMEngine: 0.10.3 04/04 12:28:45 - mmengine - INFO - MMCV: 2.1.0 04/04 12:28:45 - mmengine - INFO - MMCV Compiler: MSVC 193933523 04/04 12:28:45 - mmengine - INFO - MMCV CUDA Compiler: 11.6 04/04 12:28:45 - mmengine - INFO - MMDeploy: 1.3.1+bc75c9d 04/04 12:28:45 - mmengine - INFO -

04/04 12:28:45 - mmengine - INFO - Backend information 04/04 12:28:46 - mmengine - INFO - tensorrt: 8.6.1 04/04 12:28:46 - mmengine - INFO - tensorrt custom ops: NotAvailable 04/04 12:28:47 - mmengine - INFO - ONNXRuntime: None 04/04 12:28:47 - mmengine - INFO - ONNXRuntime-gpu: 1.16.0 04/04 12:28:47 - mmengine - INFO - ONNXRuntime custom ops: NotAvailable 04/04 12:28:47 - mmengine - INFO - pplnn: None 04/04 12:28:47 - mmengine - INFO - ncnn: None 04/04 12:28:47 - mmengine - INFO - snpe: None 04/04 12:28:47 - mmengine - INFO - openvino: None 04/04 12:28:47 - mmengine - INFO - torchscript: 2.2.1+cu118 04/04 12:28:47 - mmengine - INFO - torchscript custom ops: NotAvailable 04/04 12:28:47 - mmengine - INFO - rknn-toolkit: None 04/04 12:28:47 - mmengine - INFO - rknn-toolkit2: None 04/04 12:28:47 - mmengine - INFO - ascend: None 04/04 12:28:47 - mmengine - INFO - coreml: None 04/04 12:28:47 - mmengine - INFO - tvm: None 04/04 12:28:47 - mmengine - INFO - vacc: None 04/04 12:28:47 - mmengine - INFO -

04/04 12:28:47 - mmengine - INFO - Codebase information 04/04 12:28:47 - mmengine - INFO - mmdet: 3.2.0 04/04 12:28:47 - mmengine - INFO - mmseg: None 04/04 12:28:47 - mmengine - INFO - mmpretrain: 1.2.0 04/04 12:28:47 - mmengine - INFO - mmocr: None 04/04 12:28:47 - mmengine - INFO - mmagic: None 04/04 12:28:47 - mmengine - INFO - mmdet3d: None 04/04 12:28:47 - mmengine - INFO - mmpose: 1.3.1 04/04 12:28:47 - mmengine - INFO - mmrotate: None 04/04 12:28:47 - mmengine - INFO - mmaction: None 04/04 12:28:47 - mmengine - INFO - mmrazor: None 04/04 12:28:47 - mmengine - INFO - mmyolo: None

Reproduces the problem - code sample

from mmdeploy.apis.utils import build_task_processor
from mmdeploy.utils import get_input_shape, load_config
import torch
import time

class ModelOnnx:

    def __init__(self, deploy_cfg, model_cfg, device, backend_model):
        # read deploy_cfg and model_cfg
        deploy_cfg, model_cfg = load_config(deploy_cfg, model_cfg)

        # build task and backend model
        self.task_processor = build_task_processor(model_cfg, deploy_cfg, device)
        self.model = self.task_processor.build_backend_model(backend_model)

        self.input_shape = get_input_shape(deploy_cfg)

    def process_one_image(self, image):

        start = time.time()

        start_input = time.time()
        model_inputs, _ = self.task_processor.create_input(image, self.input_shape)
        end_input = time.time()

        print(f'Input preparation time: {((end_input - start_input)*1000):.2f} ms')
        # do model inference
        with torch.no_grad():
            result = self.model.test_step(model_inputs)

        end = time.time()

        print(f'Inference time: {((end - start)*1000):.2f} ms')

        # visualize results
        self.task_processor.visualize(
            image=image,
            model=self.model,
            result=result[0],
            window_name='visualize',
            output_file=f'{image}_output.png')
        
if __name__ == "__main__":
    deploy_cfg = 'mmdeploy/configs/mmpose/pose-detection_rtmo_onnxruntime_dynamic.py'
    model_cfg = 'mmpose/configs/body_2d_keypoint/rtmo/body7/rtmo-m_16xb16-600e_body7-640x640.py'
    device = 'cuda'
    backend_model = ['rtmo-m_body7_onnx/end2end.onnx']
    image = 'image.jpg'

    model_onnx = ModelOnnx(deploy_cfg, model_cfg, device, backend_model)
    model_onnx.process_one_image(image)

Reproduces the problem - command or script

from mmdeploy.apis.utils import build_task_processor
from mmdeploy.utils import get_input_shape, load_config
import torch
import time

class ModelOnnx:

    def __init__(self, deploy_cfg, model_cfg, device, backend_model):
        # read deploy_cfg and model_cfg
        deploy_cfg, model_cfg = load_config(deploy_cfg, model_cfg)

        # build task and backend model
        self.task_processor = build_task_processor(model_cfg, deploy_cfg, device)
        self.model = self.task_processor.build_backend_model(backend_model)

        self.input_shape = get_input_shape(deploy_cfg)

    def process_one_image(self, image):

        start = time.time()

        start_input = time.time()
        model_inputs, _ = self.task_processor.create_input(image, self.input_shape)
        end_input = time.time()

        print(f'Input preparation time: {((end_input - start_input)*1000):.2f} ms')
        # do model inference
        with torch.no_grad():
            result = self.model.test_step(model_inputs)

        end = time.time()

        print(f'Inference time: {((end - start)*1000):.2f} ms')

        # visualize results
        self.task_processor.visualize(
            image=image,
            model=self.model,
            result=result[0],
            window_name='visualize',
            output_file=f'{image}_output.png')
        
if __name__ == "__main__":
    deploy_cfg = 'mmdeploy/configs/mmpose/pose-detection_rtmo_onnxruntime_dynamic.py'
    model_cfg = 'mmpose/configs/body_2d_keypoint/rtmo/body7/rtmo-m_16xb16-600e_body7-640x640.py'
    device = 'cuda'
    backend_model = ['rtmo-m_body7_onnx/end2end.onnx']
    image = 'image.jpg'

    model_onnx = ModelOnnx(deploy_cfg, model_cfg, device, backend_model)
    model_onnx.process_one_image(image)

Reproduces the problem - error message

I don't get any error message, but the infer takes too much time. I have a RTX 2060 laptop

Input preparation time: 39.98 ms Inference time: 29846.40 ms

With pytorch RTMO I just take about 40ms to do all the process.

Additional information

I have created that script to run a RTMO onnx model, but it takes too much time, so I have to do something wrong. After run that script I get the following:

Input preparation time: 39.98 ms Inference time: 29846.40 ms

Apr 04 '24 10:04 Daanfb

mmpose mmpose copied to clipboard

[Bug] Why ONNX with RTMO takes so long?

Prerequisite

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

mmpose
mmpose copied to clipboard