mmdeploy
mmdeploy copied to clipboard
[pointpillars kitti] TensorRT ONNX slow computing time and low detection performance
Hello, thank you again for the hard work to push deployment of 3D detection models.
I made several tests on Pointpillars arch, trained on Kitti dataset (reduced) and compared the computing time and detection results for inference with torch, ONNX and TRT models. I spotted 3 problems:
- Detection results with TRT and ONNX models are low compared to torch model
- ONNX model can't be run on a GPU
- Inference time of TRT model is similar to torch model
Reproduction
Every test was conducted on the same machine, with the same setup (Nvidia GTX 2080Ti)
To convert the model in end2end.onnx and end2end.engine:
python tools/deploy.py
configs/mmdet3d/voxel-detection/voxel-detection_tensorrt_dynamic-kitti.py
../mmdetection3d/data/tensorRT_test/kitti_reduced/hv_pointpillars_secfpn_20x20_160e_kitti-3d-3class_tx3.py
../mmdetection3d/data/tensorRT_test/kitti_reduced/latest.pth
../mmdetection3d/data/tensorRT_test/kitti_reduced/kitti_000008.bin
--work-dir ~/workspace/mmdeploy_workdir/kitti_reduced/
--device cuda:0
To test inference speed and detection perf on TRT model:
python tools/test.py
configs/mmdet3d/voxel-detection/voxel-detection_tensorrt_dynamic-kitti.py
../mmdetection3d/data/tensorRT_test/kitti_reduced/hv_pointpillars_secfpn_20x20_160e_kitti-3d-3class_tx3.py
--model ../../mmdeploy_workdir/kitti_reduced/end2end.engine
--device cuda
--metrics mAP
--speed-test
For ONNX:
python tools/test.py
configs/mmdet3d/voxel-detection/voxel-detection_tensorrt_dynamic-kitti.py
../mmdetection3d/data/tensorRT_test/kitti_reduced/hv_pointpillars_secfpn_20x20_160e_kitti-3d-3class_tx3.py
--model ../../mmdeploy_workdir/kitti_reduced/end2end.onnx
--device cuda
--metrics mAP
--speed-test
Environment
python tools/check_env.py
:
2022-06-17 11:07:27,547 - mmdeploy - INFO -
2022-06-17 11:07:27,547 - mmdeploy - INFO - **********Environmental information**********
2022-06-17 11:07:27,737 - mmdeploy - INFO - sys.platform: linux
2022-06-17 11:07:27,738 - mmdeploy - INFO - Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
2022-06-17 11:07:27,738 - mmdeploy - INFO - CUDA available: True
2022-06-17 11:07:27,738 - mmdeploy - INFO - GPU 0,1: NVIDIA GeForce RTX 2080 Ti
2022-06-17 11:07:27,738 - mmdeploy - INFO - CUDA_HOME: /usr/local/cuda
2022-06-17 11:07:27,738 - mmdeploy - INFO - NVCC: Cuda compilation tools, release 10.2, V10.2.8
2022-06-17 11:07:27,738 - mmdeploy - INFO - GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
2022-06-17 11:07:27,738 - mmdeploy - INFO - PyTorch: 1.9.1+cu111
2022-06-17 11:07:27,738 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.0.5
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
2022-06-17 11:07:27,738 - mmdeploy - INFO - TorchVision: 0.10.1+cu111
2022-06-17 11:07:27,738 - mmdeploy - INFO - OpenCV: 4.5.5
2022-06-17 11:07:27,738 - mmdeploy - INFO - MMCV: 1.5.0
2022-06-17 11:07:27,738 - mmdeploy - INFO - MMCV Compiler: GCC 7.3
2022-06-17 11:07:27,738 - mmdeploy - INFO - MMCV CUDA Compiler: 11.1
2022-06-17 11:07:27,738 - mmdeploy - INFO - MMDeploy: 0.5.0+3e594b2
2022-06-17 11:07:27,738 - mmdeploy - INFO -
2022-06-17 11:07:27,738 - mmdeploy - INFO - **********Backend information**********
2022-06-17 11:07:28,219 - mmdeploy - INFO - onnxruntime: 1.8.1 ops_is_avaliable : True
2022-06-17 11:07:28,249 - mmdeploy - INFO - tensorrt: 8.2.3.0 ops_is_avaliable : True
2022-06-17 11:07:28,264 - mmdeploy - INFO - ncnn: None ops_is_avaliable : False
2022-06-17 11:07:28,265 - mmdeploy - INFO - pplnn_is_avaliable: False
2022-06-17 11:07:28,265 - mmdeploy - INFO - openvino_is_avaliable: False
2022-06-17 11:07:28,265 - mmdeploy - INFO -
2022-06-17 11:07:28,265 - mmdeploy - INFO - **********Codebase information**********
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmdet: 2.22.0
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmseg: 0.21.1
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmcls: None
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmocr: None
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmedit: None
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmdet3d: 1.0.0rc2
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmpose: None
2022-06-17 11:07:28,267 - mmdeploy - INFO - mmrotate: None
Error traceback - ONNX test
When running the command to test ONNX model, it fails when the chosen device is cuda. It works fine with cpu, but it's of course very slow.
[ ] 0/3769, elapsed: 0s, ETA:Traceback (most recent call last):
File "tools/test.py", line 160, in <module>
main()
File "tools/test.py", line 130, in main
out_dir=args.show_dir)
File "workspace/mmlab/MMDeploy/mmdeploy/codebase/mmdet3d/deploy/voxel_detection.py", line 275, in single_gpu_test
data['img_metas'][0].data, False)
File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 50, in forward
return super().forward(*inputs, **kwargs)
File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "workspace/mmlab/MMDeploy/mmdeploy/utils/timer.py", line 61, in fun
result = func(*args, **kwargs)
File "workspace/mmlab/MMDeploy/mmdeploy/codebase/mmdet3d/deploy/voxel_detection_model.py", line 99, in forward
outputs = self.wrapper(input_dict)
File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "workspace/mmlab/MMDeploy/mmdeploy/backend/onnxruntime/wrapper.py", line 90, in forward
buffer_ptr=input_tensor.data_ptr())
File "/media/local/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 384, in bind_input
element_type, shape, buffer_ptr)
RuntimeError: Error when binding input: There's no data transfer registered for copying tensors from Device:[DeviceType:1 MemoryType:0 DeviceId:0] to Device:[DeviceType:0 MemoryType:0 DeviceId:0]
Detection results
- TORCH
Overall AP11@easy, moderate, hard: bbox AP11:80.9388, 75.0167, 71.7452 bev AP11:76.2375, 68.0164, 62.8758 3d AP11:70.3602, 60.7129, 55.8411 aos AP11:75.64, 69.00, 65.68
Overall AP40@easy, moderate, hard: bbox AP40:83.7781, 76.1046, 72.1328 bev AP40:77.5823, 68.2215, 63.7780 3d AP40:71.0172, 60.2046, 55.8146 aos AP40:77.62, 69.21, 65.18
- TRT
Overall AP11@easy, moderate, hard: bbox AP11:44.7631, 32.3248, 26.8689 bev AP11:40.5515, 29.2453, 26.0872 3d AP11:39.2317, 28.2571, 25.1790 aos AP11:42.58, 30.13, 25.28
Overall AP40@easy, moderate, hard: bbox AP40:42.4627, 29.9030, 25.0707 bev AP40:39.2952, 27.3221, 22.5108 3d AP40:37.2744, 24.8895, 21.5744 aos AP40:39.77, 27.48, 22.94
- ONNX (cpu)
Overall AP11@easy, moderate, hard: bbox AP11:13.2146, 13.6500, 13.8337 bev AP11:12.9011, 13.1752, 13.2280 3d AP11:10.7523, 10.3443, 9.8281 aos AP11:12.87, 13.14, 13.24
Overall AP40@easy, moderate, hard: bbox AP40:9.6649, 10.2534, 10.3171 bev AP40:9.6426, 9.6352, 9.3478 3d AP40:6.9526, 7.0246, 6.2710 aos AP40:9.40, 9.71, 9.74
Inference time
The evaluated part is the model forward part. The computing time doesn't take into account the voxelization part nor the bbox post-processing.
- ONNX: [__ort_execute]-3710 times per count: 235.32 ms, 4.25 FPS (CPU)
- TRT: [__trt_execute]-3710 times per count: 6.72 ms, 148.88 FPS
- Torch: Between 5 and 6ms per frame.
@alexandrealmin Hi,
- You could compare performance with https://mmdeploy.readthedocs.io/en/latest/02-how-to-run/how_to_evaluate_a_model.html
- You have to install
onnxruntime-gpu
withpip
to onnxruntime on GPU. - Because the FPS is quite high and percentage of GPU usage is low. So there might be less space to improve on GPU side.
Detection results with TRT and ONNX models are low compared to torch model
ONNX model can't be run on a GPU
Inference time of TRT model is similar to torch model
@RunningLeon Hi, Thank you for your answers
-
Yes that's what I did, I tested on different servers, and the detection results from TensorRT are bad compared to Torch and ONNX models.
-
Thanks, indeed it works a lot better. ONNX and Torch models give exactly the same detection scores as expected, showing that there is definitely something wrong with TensorRT conversion (cf 1.). I'm open to suggestion to perform tests.
-
I didn't get the chance to go deeper on this topic, but when testing on the full kitti scans (not the reduced version with ~10% of points), GPU usage is high (90-100%)
Can you specifically talk about how the low accuracy problem of tensorrt is solved? I also faced the same problem when converting mask_rcnn.
It's actually not solved, I'm working on it, but also open to suggestions
I still don't understand the low accuracy in FP32 TRT, but I made several tests, with the following conclusions:
- Conversion to TensorRT with FP16 precision is giving the same results as ONNX and torch models
- Increasing
max_workspace_size
variable does not change results but suppress some warnings - ONNX OP version shouldn't be updated above 11 as MMDeploy does not support superior versions
- To try new OP versions, pytorch upgrade is necessary.
- With pytorch 1.10.0 and CUDA 11.4 and TensorRT 8.2.3.0, the conversion from ONNX to TensorRT is not working.
- With pytorch 1.10.0 and CUDA 11.4 and TensorRT 8.4.1.5, the conversion from ONNX to TensorRT is functional, but it doesn't change the bad detection results in FP32 precision. Worst, the conversion to FP16 is functional but give 0.00 scores for every metric.
- An upgrade to CUDA 11.6 was not possible as it requires pytorch 1.12.0, which is not yet comaptible with MMDeploy. #745
I hope it can help, even if it does not solve the problem
@alexandrealmin Hi,
- You could compare performance with https://mmdeploy.readthedocs.io/en/latest/02-how-to-run/how_to_evaluate_a_model.html
- You have to install
onnxruntime-gpu
withpip
to onnxruntime on GPU.- Because the FPS is quite high and percentage of GPU usage is low. So there might be less space to improve on GPU side.
Detection results with TRT and ONNX models are low compared to torch model ONNX model can't be run on a GPU Inference time of TRT model is similar to torch model
Could you please talk more about onnxruntime-gpu? I have tried many times,but can't fix it.
@lvskiller Hi, can you give details about your problem please? At first my env may confuse between onnxruntime and onnxruntime-gpu. Be sure to have a clean install of dependencies and build of mmdet3d
@lvskiller Hi, can you give details about your problem please? At first my env may confuse between onnxruntime and onnxruntime-gpu. Be sure to have a clean install of dependencies and build of mmdet3d
Oh, I didn't clean the onnxruntime(cpu) at first,so maybe that's the key. I will try it further.
Hit this and just needed to upgrade
pip install --upgrade onnx onnxruntime-gpu onnxruntime
hi, I have run the command: pip install --upgrade onnx onnxruntime-gpu onnxruntime
but still got the same
error:RuntimeError: Error when binding input: There's no data transfer registered for copying tensors from Device:[DeviceType:1 MemoryType:0 DeviceId:0] to Device:[DeviceType:0 MemoryType:0 DeviceId:0]
I still don't understand the low accuracy in FP32 TRT, but I made several tests, with the following conclusions:
- Conversion to TensorRT with FP16 precision is giving the same results as ONNX and torch models
- Increasing
max_workspace_size
variable does not change results but suppress some warnings- ONNX OP version shouldn't be updated above 11 as MMDeploy does not support superior versions
- To try new OP versions, pytorch upgrade is necessary.
- With pytorch 1.10.0 and CUDA 11.4 and TensorRT 8.2.3.0, the conversion from ONNX to TensorRT is not working.
- With pytorch 1.10.0 and CUDA 11.4 and TensorRT 8.4.1.5, the conversion from ONNX to TensorRT is functional, but it doesn't change the bad detection results in FP32 precision. Worst, the conversion to FP16 is functional but give 0.00 scores for every metric.
- An upgrade to CUDA 11.6 was not possible as it requires pytorch 1.12.0, which is not yet comaptible with MMDeploy. torch2onnx failed in adaptive_avg_pool2d for PSPNet with torch==1.12 #745
I hope it can help, even if it does not solve the problem
Hi, Have you solved low accuracy problem of tensorrt?I faced the same problem when converting oritented_rcnn to trt.