mmdeploy [Bug] C++ TensorRT inference slower than python

[Bug] C++ TensorRT inference slower than python

Open gachiemchiep opened this issue 8 months ago • 3 comments

Checklist

[X] I have searched related issues but cannot get the expected help.
[X] 2. I have read the FAQ documentation but cannot get the expected help.
[X] 3. The bug has not been fixed in the latest version.

Describe the bug

I converted rtmdet-ins_s model into tensorrt, then I want to use the C++ inference for ideal speed. But unfortunately, C++ demo run very slow.

# detector.cxx
time ./bin/detector --device cuda /datadrive/workspace/JIL/202210.exibition/ros2_ws/src/ros2_mmdet/ros2_mmdet_bringup/config/rtmdet/rtmdet-ins_s_8xb32-300e_jil /datadrive/workspace/JIL/202210.exibition/dataset/frisk_case/type_001_val/IMG_1098.jpg
[2023-10-21 15:34:08.765] [mmdeploy] [info] [model.cpp:35] [DirectoryModel] Load model: "/datadrive/workspace/JIL/202210.exibition/ros2_ws/src/ros2_mmdet/ros2_mmdet_bringup/config/rtmdet/rtmdet-ins_s_8xb32-300e_jil"
[2023-10-21 15:34:10.692] [mmdeploy] [warning] [trt_net.cpp:24] TRTNet: TensorRT was linked against cuBLAS/cuBLAS LT 11.6.5 but loaded cuBLAS/cuBLAS LT 111.0.1
[2023-10-21 15:34:10.706] [mmdeploy] [warning] [trt_net.cpp:24] TRTNet: TensorRT was linked against cuBLAS/cuBLAS LT 11.6.5 but loaded cuBLAS/cuBLAS LT 111.0.1
bbox 0, left=111.20, top=300.81, right=440.51, bottom=672.70, label=1, score=0.9782
mask 0, height=1008, width=756

real	0m41.470s
user	0m27.628s
sys	0m56.070s

# object_detection.cpp
time ./bin/object_detection  cuda /datadrive/workspace/JIL/202210.exibition/ros2_ws/src/ros2_mmdet/ros2_mmdet_bringup/config/rtmdet/rtmdet-ins_s_8xb32-300e_jil /datadrive/workspace/JIL/202210.exibition/dataset/frisk_case/type_001_val/IMG_1098.jpg
[2023-10-21 15:36:39.143] [mmdeploy] [info] [model.cpp:35] [DirectoryModel] Load model: "/datadrive/workspace/JIL/202210.exibition/ros2_ws/src/ros2_mmdet/ros2_mmdet_bringup/config/rtmdet/rtmdet-ins_s_8xb32-300e_jil"
[2023-10-21 15:36:41.843] [mmdeploy] [warning] [trt_net.cpp:24] TRTNet: TensorRT was linked against cuBLAS/cuBLAS LT 11.6.5 but loaded cuBLAS/cuBLAS LT 111.0.1
[2023-10-21 15:36:41.864] [mmdeploy] [warning] [trt_net.cpp:24] TRTNet: TensorRT was linked against cuBLAS/cuBLAS LT 11.6.5 but loaded cuBLAS/cuBLAS LT 111.0.1
bbox_count=1
box 0, left=111.20, top=300.81, right=440.51, bottom=672.70, label=1, score=0.9782
mask 0, height=1008, width=756

real	0m5.730s
user	0m3.579s
sys	0m5.112s

I used the tools/profiler.py to check and the result is like this

# Before converting
python tools/profiler.py \
    configs/mmdet/instance-seg/instance-seg_rtmdet-ins_tensorrt_static-640x640.py  \
    /datadrive/workspace/JIL/202210.exibition/ros2_ws/src/ros2_mmdet/ros2_mmdet_bringup/config/rtmdet/rtmdet-ins_s_8xb32-300e_jil.py \
    tmp \
    --model /datadrive/workspace/JIL/202210.exibition/ros2_ws/src/ros2_mmdet/ros2_mmdet_bringup/config/rtmdet/epoch_280.pth \
    --device cuda --shape 640x640 --num-iter 100


+--------+------------+--------+
| Stats  | Latency/ms |  FPS   |
+--------+------------+--------+
|  Mean  |   31.741   | 31.505 |
| Median |   31.525   | 31.721 |
|  Min   |   29.368   | 34.051 |
|  Max   |   39.096   | 25.578 |

# After converting to tensorRT
 python tools/profiler.py \
    configs/mmdet/instance-seg/instance-seg_rtmdet-ins_tensorrt_static-640x640.py  \
    /datadrive/workspace/JIL/202210.exibition/ros2_ws/src/ros2_mmdet/ros2_mmdet_bringup/config/rtmdet/rtmdet-ins_s_8xb32-300e_jil.py \
    tmp \
    --model /datadrive/workspace/JIL/202210.exibition/ros2_ws/src/ros2_mmdet/ros2_mmdet_bringup/config/rtmdet/rtmdet-ins_s_8xb32-300e_jil/end2end.engine \
    --device cuda --shape 640x640 --num-iter 100

+--------+------------+--------+
| Stats  | Latency/ms |  FPS   |
+--------+------------+--------+
|  Mean  |   22.498   | 44.449 |
| Median |   22.213   | 45.019 |
|  Min   |   21.788   | 45.896 |
|  Max   |   25.409   | 39.356 |

I tried to compiling mmdeploy and linked against built library, but it didn't help. Is there any method I can use to improve the inference speed of C++ SDK with TensorRT ?

Thanks

Reproduction

No, I didn't

Environment

10/21 15:46:09 - mmengine - INFO - 

10/21 15:46:09 - mmengine - INFO - **********Environmental information**********
10/21 15:46:12 - mmengine - INFO - sys.platform: linux
10/21 15:46:12 - mmengine - INFO - Python: 3.8.0 | packaged by conda-forge | (default, Nov 22 2019, 19:11:38) [GCC 7.3.0]
10/21 15:46:12 - mmengine - INFO - CUDA available: True
10/21 15:46:12 - mmengine - INFO - numpy_random_seed: 2147483648
10/21 15:46:12 - mmengine - INFO - GPU 0: NVIDIA GeForce GTX 1060 6GB
10/21 15:46:12 - mmengine - INFO - CUDA_HOME: /usr/local/cuda
10/21 15:46:12 - mmengine - INFO - NVCC: Cuda compilation tools, release 11.7, V11.7.64
10/21 15:46:12 - mmengine - INFO - GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
10/21 15:46:12 - mmengine - INFO - PyTorch: 1.13.1+cu117
10/21 15:46:12 - mmengine - INFO - PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.9  (built against CUDA 11.8)
    - Built with CuDNN 8.5
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

10/21 15:46:12 - mmengine - INFO - TorchVision: 0.14.1+cu117
10/21 15:46:12 - mmengine - INFO - OpenCV: 4.8.1
10/21 15:46:12 - mmengine - INFO - MMEngine: 0.8.4
10/21 15:46:12 - mmengine - INFO - MMCV: 2.0.1
10/21 15:46:12 - mmengine - INFO - MMCV Compiler: GCC 9.3
10/21 15:46:12 - mmengine - INFO - MMCV CUDA Compiler: 11.7
10/21 15:46:12 - mmengine - INFO - MMDeploy: 1.3.0+1090fb6
10/21 15:46:12 - mmengine - INFO - 

10/21 15:46:12 - mmengine - INFO - **********Backend information**********
10/21 15:46:12 - mmengine - INFO - tensorrt:	8.2.3.0
10/21 15:46:12 - mmengine - INFO - tensorrt custom ops:	Available
10/21 15:46:12 - mmengine - INFO - ONNXRuntime:	1.16.0
10/21 15:46:12 - mmengine - INFO - ONNXRuntime-gpu:	1.8.1
10/21 15:46:12 - mmengine - INFO - ONNXRuntime custom ops:	Available
10/21 15:46:12 - mmengine - INFO - pplnn:	None
10/21 15:46:12 - mmengine - INFO - ncnn:	None
10/21 15:46:12 - mmengine - INFO - snpe:	None
10/21 15:46:12 - mmengine - INFO - openvino:	None
10/21 15:46:12 - mmengine - INFO - torchscript:	1.13.1+cu117
10/21 15:46:12 - mmengine - INFO - torchscript custom ops:	NotAvailable
10/21 15:46:12 - mmengine - INFO - rknn-toolkit:	None
10/21 15:46:12 - mmengine - INFO - rknn-toolkit2:	None
10/21 15:46:12 - mmengine - INFO - ascend:	None
10/21 15:46:12 - mmengine - INFO - coreml:	None
10/21 15:46:12 - mmengine - INFO - tvm:	None
10/21 15:46:12 - mmengine - INFO - vacc:	None
10/21 15:46:12 - mmengine - INFO - 

10/21 15:46:12 - mmengine - INFO - **********Codebase information**********
10/21 15:46:12 - mmengine - INFO - mmdet:	3.1.0
10/21 15:46:12 - mmengine - INFO - mmseg:	None
10/21 15:46:12 - mmengine - INFO - mmpretrain:	None
10/21 15:46:12 - mmengine - INFO - mmocr:	1.0.1
10/21 15:46:12 - mmengine - INFO - mmagic:	1.0.2
10/21 15:46:12 - mmengine - INFO - mmdet3d:	None
10/21 15:46:12 - mmengine - INFO - mmpose:	None
10/21 15:46:12 - mmengine - INFO - mmrotate:	None
10/21 15:46:12 - mmengine - INFO - mmaction:	None
10/21 15:46:12 - mmengine - INFO - mmrazor:	None
10/21 15:46:12 - mmengine - INFO - mmyolo:	None

Error traceback

No response

Oct 21 '23 06:10 gachiemchiep

I also noticed the same for RTMPose.

Oct 21 '23 19:10 Y-T-G

I had a similar issue for RTMDet instance segmentation. When checking the profiler I realized that the postprocess step was taking 90% of the inference time. That was caused by the huge images I was using as input (something like 2400x1000), and the number of best masks that the model kept.

The size of input image matters because the SDK will resize the results to the original input image, so if you have 100 masks of 640x640 it will resize all of them to the original size and this takes a lot. Consequently, if you reduce the number of masks to resize, the cost will be lower.

The best solution for me would be to have the option to avoid this postprocess step, treating the raw output model as I want and optimizing the process for each use case. However, I have not been able to set up a MMDeploy configuration to do so, so I came with a workaround.

To deal with image resize I previously resize my input image before actually inputing to the model, this way it will output masks of same resized size and it won't take a lot of time resizing masks.

To deal with many masks problem, I lowered to a 10% default value nms_pre and max_per_img, which I found to work really well on my use case, but you just have to evaluate for your specific scenario. Also, I have increased score_thr to 0.5 .

Oct 23 '23 07:10 gnscc

@gnscc

Well it is very odd. Because this C++ inference speed is even slower than python without converting into tensorRT. I guess some python magics does put those post-processing in parallel and it run very fast.

I tried your advice and edit some values inside pipeline.json. But it is still very slow comparing to python inference (without converting to tensorRt)

# converted model's pipeline.json
            {
                "type": "Task",
                "module": "mmdet",
                "name": "postprocess",
                "component": "ResizeInstanceMask",
                "params": {
                    "mask_thr_binary": 0.5,
                    "max_per_img": 10,
                    "min_bbox_size": 0,
                    "nms": {
                        "iou_threshold": 0.6,
                        "type": "nms"
                    },
                    "nms_pre": 100,
                    "score_thr": 0.05,
                    "is_resize_mask": true
                },
                "output": [
                    "post_output"
                ],
                "input": [
                    "prep_output",
                    "infer_output"
                ]
            }

Oct 31 '23 15:10 gachiemchiep

mmdeploy mmdeploy copied to clipboard

[Bug] C++ TensorRT inference slower than python

Checklist

Describe the bug

Reproduction

Environment

Error traceback

mmdeploy
mmdeploy copied to clipboard