mmdeploy [pointpillars kitti] TensorRT ONNX slow computing time and low detection performance

Hello, thank you again for the hard work to push deployment of 3D detection models.

I made several tests on Pointpillars arch, trained on Kitti dataset (reduced) and compared the computing time and detection results for inference with torch, ONNX and TRT models. I spotted 3 problems:

Detection results with TRT and ONNX models are low compared to torch model
ONNX model can't be run on a GPU
Inference time of TRT model is similar to torch model

Reproduction

Every test was conducted on the same machine, with the same setup (Nvidia GTX 2080Ti)

To convert the model in end2end.onnx and end2end.engine:

python tools/deploy.py
configs/mmdet3d/voxel-detection/voxel-detection_tensorrt_dynamic-kitti.py 
../mmdetection3d/data/tensorRT_test/kitti_reduced/hv_pointpillars_secfpn_20x20_160e_kitti-3d-3class_tx3.py
../mmdetection3d/data/tensorRT_test/kitti_reduced/latest.pth
../mmdetection3d/data/tensorRT_test/kitti_reduced/kitti_000008.bin
--work-dir ~/workspace/mmdeploy_workdir/kitti_reduced/
--device cuda:0

To test inference speed and detection perf on TRT model:

python tools/test.py
configs/mmdet3d/voxel-detection/voxel-detection_tensorrt_dynamic-kitti.py 
../mmdetection3d/data/tensorRT_test/kitti_reduced/hv_pointpillars_secfpn_20x20_160e_kitti-3d-3class_tx3.py 
--model ../../mmdeploy_workdir/kitti_reduced/end2end.engine 
--device cuda 
--metrics mAP 
--speed-test

For ONNX:

python tools/test.py
configs/mmdet3d/voxel-detection/voxel-detection_tensorrt_dynamic-kitti.py 
../mmdetection3d/data/tensorRT_test/kitti_reduced/hv_pointpillars_secfpn_20x20_160e_kitti-3d-3class_tx3.py 
--model ../../mmdeploy_workdir/kitti_reduced/end2end.onnx 
--device cuda 
--metrics mAP 
--speed-test

Environment

python tools/check_env.py:

2022-06-17 11:07:27,547 - mmdeploy - INFO - 

2022-06-17 11:07:27,547 - mmdeploy - INFO - **********Environmental information**********
2022-06-17 11:07:27,737 - mmdeploy - INFO - sys.platform: linux
2022-06-17 11:07:27,738 - mmdeploy - INFO - Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
2022-06-17 11:07:27,738 - mmdeploy - INFO - CUDA available: True
2022-06-17 11:07:27,738 - mmdeploy - INFO - GPU 0,1: NVIDIA GeForce RTX 2080 Ti
2022-06-17 11:07:27,738 - mmdeploy - INFO - CUDA_HOME: /usr/local/cuda
2022-06-17 11:07:27,738 - mmdeploy - INFO - NVCC: Cuda compilation tools, release 10.2, V10.2.8
2022-06-17 11:07:27,738 - mmdeploy - INFO - GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
2022-06-17 11:07:27,738 - mmdeploy - INFO - PyTorch: 1.9.1+cu111
2022-06-17 11:07:27,738 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

2022-06-17 11:07:27,738 - mmdeploy - INFO - TorchVision: 0.10.1+cu111
2022-06-17 11:07:27,738 - mmdeploy - INFO - OpenCV: 4.5.5
2022-06-17 11:07:27,738 - mmdeploy - INFO - MMCV: 1.5.0
2022-06-17 11:07:27,738 - mmdeploy - INFO - MMCV Compiler: GCC 7.3
2022-06-17 11:07:27,738 - mmdeploy - INFO - MMCV CUDA Compiler: 11.1
2022-06-17 11:07:27,738 - mmdeploy - INFO - MMDeploy: 0.5.0+3e594b2
2022-06-17 11:07:27,738 - mmdeploy - INFO - 

2022-06-17 11:07:27,738 - mmdeploy - INFO - **********Backend information**********
2022-06-17 11:07:28,219 - mmdeploy - INFO - onnxruntime: 1.8.1	ops_is_avaliable : True
2022-06-17 11:07:28,249 - mmdeploy - INFO - tensorrt: 8.2.3.0	ops_is_avaliable : True
2022-06-17 11:07:28,264 - mmdeploy - INFO - ncnn: None	ops_is_avaliable : False
2022-06-17 11:07:28,265 - mmdeploy - INFO - pplnn_is_avaliable: False
2022-06-17 11:07:28,265 - mmdeploy - INFO - openvino_is_avaliable: False
2022-06-17 11:07:28,265 - mmdeploy - INFO - 

2022-06-17 11:07:28,265 - mmdeploy - INFO - **********Codebase information**********
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmdet:	2.22.0
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmseg:	0.21.1
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmcls:	None
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmocr:	None
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmedit:	None
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmdet3d:	1.0.0rc2
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmpose:	None
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmrotate:	None

Error traceback - ONNX test

When running the command to test ONNX model, it fails when the chosen device is cuda. It works fine with cpu, but it's of course very slow.

[                                                  ] 0/3769, elapsed: 0s, ETA:Traceback (most recent call last):
  File "tools/test.py", line 160, in <module>
    main()
  File "tools/test.py", line 130, in main
    out_dir=args.show_dir)
  File "workspace/mmlab/MMDeploy/mmdeploy/codebase/mmdet3d/deploy/voxel_detection.py", line 275, in single_gpu_test
    data['img_metas'][0].data, False)
  File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 50, in forward
    return super().forward(*inputs, **kwargs)
  File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "workspace/mmlab/MMDeploy/mmdeploy/utils/timer.py", line 61, in fun
    result = func(*args, **kwargs)
  File "workspace/mmlab/MMDeploy/mmdeploy/codebase/mmdet3d/deploy/voxel_detection_model.py", line 99, in forward
    outputs = self.wrapper(input_dict)
  File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "workspace/mmlab/MMDeploy/mmdeploy/backend/onnxruntime/wrapper.py", line 90, in forward
    buffer_ptr=input_tensor.data_ptr())
  File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 384, in bind_input
    element_type, shape, buffer_ptr)
RuntimeError: Error when binding input: There's no data transfer registered for copying tensors from Device:[DeviceType:1 MemoryType:0 DeviceId:0] to Device:[DeviceType:0 MemoryType:0 DeviceId:0]

Detection results

TORCH

Overall AP11@easy, moderate, hard: bbox AP11:80.9388, 75.0167, 71.7452 bev AP11:76.2375, 68.0164, 62.8758 3d AP11:70.3602, 60.7129, 55.8411 aos AP11:75.64, 69.00, 65.68

Overall AP40@easy, moderate, hard: bbox AP40:83.7781, 76.1046, 72.1328 bev AP40:77.5823, 68.2215, 63.7780 3d AP40:71.0172, 60.2046, 55.8146 aos AP40:77.62, 69.21, 65.18

TRT

Overall AP11@easy, moderate, hard: bbox AP11:44.7631, 32.3248, 26.8689 bev AP11:40.5515, 29.2453, 26.0872 3d AP11:39.2317, 28.2571, 25.1790 aos AP11:42.58, 30.13, 25.28

Overall AP40@easy, moderate, hard: bbox AP40:42.4627, 29.9030, 25.0707 bev AP40:39.2952, 27.3221, 22.5108 3d AP40:37.2744, 24.8895, 21.5744 aos AP40:39.77, 27.48, 22.94

ONNX (cpu)

Overall AP11@easy, moderate, hard: bbox AP11:13.2146, 13.6500, 13.8337 bev AP11:12.9011, 13.1752, 13.2280 3d AP11:10.7523, 10.3443, 9.8281 aos AP11:12.87, 13.14, 13.24

Overall AP40@easy, moderate, hard: bbox AP40:9.6649, 10.2534, 10.3171 bev AP40:9.6426, 9.6352, 9.3478 3d AP40:6.9526, 7.0246, 6.2710 aos AP40:9.40, 9.71, 9.74

Inference time

The evaluated part is the model forward part. The computing time doesn't take into account the voxelization part nor the bbox post-processing.

ONNX: [__ort_execute]-3710 times per count: 235.32 ms, 4.25 FPS (CPU)
TRT: [__trt_execute]-3710 times per count: 6.72 ms, 148.88 FPS
Torch: Between 5 and 6ms per frame.

Jun 17 '22 12:06 alexandrealmin

@alexandrealmin Hi,

You could compare performance with https://mmdeploy.readthedocs.io/en/latest/02-how-to-run/how_to_evaluate_a_model.html
You have to install onnxruntime-gpu with pip to onnxruntime on GPU.
Because the FPS is quite high and percentage of GPU usage is low. So there might be less space to improve on GPU side.

Detection results with TRT and ONNX models are low compared to torch model
ONNX model can't be run on a GPU
Inference time of TRT model is similar to torch model

Jun 20 '22 07:06 RunningLeon

@RunningLeon Hi, Thank you for your answers

Yes that's what I did, I tested on different servers, and the detection results from TensorRT are bad compared to Torch and ONNX models.
Thanks, indeed it works a lot better. ONNX and Torch models give exactly the same detection scores as expected, showing that there is definitely something wrong with TensorRT conversion (cf 1.). I'm open to suggestion to perform tests.
I didn't get the chance to go deeper on this topic, but when testing on the full kitti scans (not the reduced version with ~10% of points), GPU usage is high (90-100%)

Jul 05 '22 15:07 alexandrealmin

Can you specifically talk about how the low accuracy problem of tensorrt is solved? I also faced the same problem when converting mask_rcnn.

Jul 20 '22 01:07 jiaqizhang123-stack

It's actually not solved, I'm working on it, but also open to suggestions

Jul 20 '22 08:07 alexandrealmin

I still don't understand the low accuracy in FP32 TRT, but I made several tests, with the following conclusions:

Conversion to TensorRT with FP16 precision is giving the same results as ONNX and torch models
Increasing max_workspace_size variable does not change results but suppress some warnings
ONNX OP version shouldn't be updated above 11 as MMDeploy does not support superior versions
To try new OP versions, pytorch upgrade is necessary.
With pytorch 1.10.0 and CUDA 11.4 and TensorRT 8.2.3.0, the conversion from ONNX to TensorRT is not working.
With pytorch 1.10.0 and CUDA 11.4 and TensorRT 8.4.1.5, the conversion from ONNX to TensorRT is functional, but it doesn't change the bad detection results in FP32 precision. Worst, the conversion to FP16 is functional but give 0.00 scores for every metric.
An upgrade to CUDA 11.6 was not possible as it requires pytorch 1.12.0, which is not yet comaptible with MMDeploy. #745

I hope it can help, even if it does not solve the problem

Jul 28 '22 13:07 alexandrealmin

@alexandrealmin Hi,

You could compare performance with https://mmdeploy.readthedocs.io/en/latest/02-how-to-run/how_to_evaluate_a_model.html

You have to install onnxruntime-gpu with pip to onnxruntime on GPU.

Because the FPS is quite high and percentage of GPU usage is low. So there might be less space to improve on GPU side.
Detection results with TRT and ONNX models are low compared to torch model
ONNX model can't be run on a GPU
Inference time of TRT model is similar to torch model

Could you please talk more about onnxruntime-gpu? I have tried many times,but can't fix it.

Aug 23 '22 15:08 MelosY

@lvskiller Hi, can you give details about your problem please? At first my env may confuse between onnxruntime and onnxruntime-gpu. Be sure to have a clean install of dependencies and build of mmdet3d

Aug 30 '22 13:08 alexandrealmin

@lvskiller Hi, can you give details about your problem please? At first my env may confuse between onnxruntime and onnxruntime-gpu. Be sure to have a clean install of dependencies and build of mmdet3d

Oh, I didn't clean the onnxruntime(cpu) at first,so maybe that's the key. I will try it further.

Aug 31 '22 01:08 MelosY

Hit this and just needed to upgrade

pip install --upgrade onnx onnxruntime-gpu onnxruntime

Jul 17 '23 08:07 GeorgePearse

hi, I have run the command: pip install --upgrade onnx onnxruntime-gpu onnxruntime but still got the same

error:RuntimeError: Error when binding input: There's no data transfer registered for copying tensors from Device:[DeviceType:1 MemoryType:0 DeviceId:0] to Device:[DeviceType:0 MemoryType:0 DeviceId:0]

Aug 31 '23 02:08 Worromots

I still don't understand the low accuracy in FP32 TRT, but I made several tests, with the following conclusions:

Conversion to TensorRT with FP16 precision is giving the same results as ONNX and torch models

Increasing max_workspace_size variable does not change results but suppress some warnings

ONNX OP version shouldn't be updated above 11 as MMDeploy does not support superior versions

To try new OP versions, pytorch upgrade is necessary.

With pytorch 1.10.0 and CUDA 11.4 and TensorRT 8.2.3.0, the conversion from ONNX to TensorRT is not working.

With pytorch 1.10.0 and CUDA 11.4 and TensorRT 8.4.1.5, the conversion from ONNX to TensorRT is functional, but it doesn't change the bad detection results in FP32 precision. Worst, the conversion to FP16 is functional but give 0.00 scores for every metric.

An upgrade to CUDA 11.6 was not possible as it requires pytorch 1.12.0, which is not yet comaptible with MMDeploy. torch2onnx failed in adaptive_avg_pool2d for PSPNet with torch==1.12 #745

I hope it can help, even if it does not solve the problem

Hi, Have you solved low accuracy problem of tensorrt？I faced the same problem when converting oritented_rcnn to trt.

Aug 31 '23 03:08 Worromots

mmdeploy mmdeploy copied to clipboard

[pointpillars kitti] TensorRT ONNX slow computing time and low detection performance

mmdeploy
mmdeploy copied to clipboard