onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

TensorrtExecutionProvider slower than CUDAExecutionProvider: Faster-rcnn [Performance]

Open datinje opened this issue 2 years ago • 89 comments

Describe the issue

on my Faster-rcnn-rpn models doing detections of patterns, after considerable efforts to infer with TensorRT EP, (see https://github.com/microsoft/onnxruntime/issues/16886 as this shows that I have simplified the model and infered the shapes of the model nodes before submitting to TRT) , I found that TRT EP is about 30% slower than with Cuda EP in FP32 (and in TF32) - only with FP16 TRT EP -almost- catches up.

I only mentions here the second inference , not the warm up once (which is considerably slower which is normal)

After looking at the VERBOSE mode logs , found out that not all the nodes are running on TRT, one is still on CPU and 6 on Cuda EP. That cause many memory transfers between Host and GPU . I suppose this is the reason. So my question is why is ther still nodes on CPU and Cuda EPs ? Can this be fixed ?

Here are the logs : 2023-09-06 16:45:59.604024060 [V:onnxruntime:, session_state.cc:1149 VerifyEachNodeIsAssignedToAnEp] Node placements 2023-09-06 16:45:59.604038849 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [TensorrtExecutionProvider]. Number of nodes: 11 2023-09-06 16:45:59.604042765 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_0 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_0_0) 2023-09-06 16:45:59.604046398 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_1 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_1_1) 2023-09-06 16:45:59.604049385 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_2 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_2_2) 2023-09-06 16:45:59.604052381 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_3 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_3_3) 2023-09-06 16:45:59.604055213 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_4 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_4_4) 2023-09-06 16:45:59.604057978 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_5 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_5_5) 2023-09-06 16:45:59.604060720 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] TRTKernel_graph_torch_jit_15684953649142847852_6 (TensorrtExecutionProvider_TRTKernel_graph_torch_jit_15684953649142847852_6_6) 2023-09-06 16:45:59.604063521 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyFromHost (Memcpy) 2023-09-06 16:45:59.604066111 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_422) 2023-09-06 16:45:59.604068754 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_423) 2023-09-06 16:45:59.604078119 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] MemcpyToHost (Memcpy_token_424) 2023-09-06 16:45:59.604081367 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1 2023-09-06 16:45:59.604086459 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] RoiAlign (/model/roi_heads/box_pooler/level_poolers.0/RoiAlign) 2023-09-06 16:45:59.604093948 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp] Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 5 2023-09-06 16:45:59.604099017 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/proposal_generator/NonZero) 2023-09-06 16:45:59.604103942 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (NonMaxSuppression_497) 2023-09-06 16:45:59.604108777 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/roi_heads/NonZero) 2023-09-06 16:45:59.604113159 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonMaxSuppression (NonMaxSuppression_796) 2023-09-06 16:45:59.604117903 [V:onnxruntime:, session_state.cc:1157 VerifyEachNodeIsAssignedToAnEp] NonZero (/model/NonZero)

I got the same issue in both C++ and python runtime APIs

To reproduce

I can't share my model for IP , but I see similar issues with public Detectron Model zoo faster-rcnn-rpn (see https://github.com/microsoft/onnxruntime/issues/16886) how to run it - but with this one even more nodes are fallback on CPU and cuda , among which the nodes in bold above. So maybe fixes investigating this one will lead to same fixes.

Urgency

I have been blocked for several months on trying to run the model on TRT EP (see https://github.com/microsoft/onnxruntime/issues/16886 thx for the ort staff that helped me) now to find out that this may not be worth. Looks like I am not fat - only actually 3 operator/nodes to go on TRT EP, but times up I will need in a couple of month to freeze the model to certify the results with no second chance certifying with TRT FP16 or better INT8. I am expecting a x2 perf improvement in TRT fp16 and another x2 improvement in INT8 (accuracy is still excellent in FP16).

Platform

Linux

OS Version

SLES15 SP4

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.15.1+ (using main latest for a fix to build TRT EP)

ONNX Runtime API

Python

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

TensorRT 8.6.1

Model File

I can't but could use fatser-rcnn-rpn from detectron2 model zoo (see https://github.com/microsoft/onnxruntime/issues/16886)

Is this a quantized model?

No

datinje avatar Sep 06 '23 17:09 datinje

onnx-trt parser filters out NonMaxSuppression, NonZero, and RoiAlign, so that's why you saw those nodes are placed on CUDA/CPU EP. i also think that many memcpy between CPU/GPU causes this long latency issue. It's the general issue for rcnn models meaning you will see many TRT/CUDA/CPU partitions. We will discuss with Nvidia to see whether it's possible/beneficial to incorporate those nodes (i doubt this) or we can find other ways to improve the performance.

chilo-ms avatar Sep 06 '23 19:09 chilo-ms

Tensorrt supports nmsplugin and rioAlignPlugin. Probably we can replace onnx NonMaxSuppression and RoiAlign nodes with those two TRT plugins to see the latency?

chilo-ms avatar Sep 06 '23 20:09 chilo-ms

Typically the nodes from NonMaxSuppression and on are selecting the best bounding boxes. These are relatively cheap operations where it's more efficient to stay on CPU than go back to GPU. In the NNAPI EP we have the option to set an operator after which NNAPI is not used, and we do that for NonMaxSuppression. Maybe something similar would also work for TRT/CUDA for this type of model.

skottmckay avatar Sep 07 '23 00:09 skottmckay

onnx-trt parser filters out NonMaxSuppression, NonZero, and RoiAlign, so that's why you saw those nodes are placed on CUDA/CPU EP. i also think that many memcpy between CPU/GPU causes this long latency issue. It's the general issue for rcnn models meaning you will see many TRT/CUDA/CPU partitions. We will discuss with Nvidia to see whether it's possible/beneficial to incorporate those nodes (i doubt this) or we can find other ways to improve the performance.

So, even since , according to @skottmckay, these 3 operators are cheaper on CPU, can we try to keep them on GPU to avoid the overhead of moving the data btw CPU and GPU (in my case images of 13MB) ? Is that the goal/capability of the nmsplugin and roiAlignPlugin ? I am ready to try . Any example how to do that ? Shall I modify the Model code, the resulting ONNX or is that a mere declaration in onnxruntime tensorRT EP configuration ? What about the third operator nonZero ? I could not find a plugin any possibility to keep it on GPU to avoid memory transfers due to other subgraph split ?

datinje avatar Sep 07 '23 07:09 datinje

If I want to test the performance I get by not filtering out these operators by commenting out the lines https://github.com/onnx/onnx-tensorrt/blob/main/ModelImporter.cpp#L377, then where shall I modify the ModelImporter.cpp file before recompiling onnxruntime ?

I am recompiling onnxruntime with nvidia gpu and tensorrt EP in my docker image with: RUN git clone https://github.com/microsoft/onnxruntime WORKDIR /tmp/onnxruntime RUN CC=gcc-11 CXX=g++-11 ./build.sh --config RelWithDebInfo --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root (I am using the latest main as a bug was fixed on ort 1.15.1 to compile tensorrt EP)

datinje avatar Sep 07 '23 09:09 datinje

what if I compile onnxruntime with --use_tensorrt_builtin_parser : will teh nodes be filtered out ?

datinje avatar Sep 07 '23 10:09 datinje

no change if I recompile onnxruntime with -use_tensorrt_builtin_parser The nodes are still placed on CPU

datinje avatar Sep 07 '23 11:09 datinje

Here are the steps to build OSS onnx-tensorrt parser with not filtering out those operators:

  1. add --use_tensorrt_oss_parser as one of the ORT build arguments and start building.
  2. At the beginning stage of ORT build, you will find onnx-tensorrt repos being downloaded to path ./build/Linux/Debug/_deps/onnx_tensorrt-src, simply comment out those lines of node filtering in ModelImporter.cpp
  3. Resume build. Note: you might encounter build error of CUDA_INCLUDE_DIR not found. Modify here to set(CUDA_INCLUDE_DIR ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})

I tested the not filtering out onnx-tensorrt parser with faster rcnn form onnx model zoo and it can include those nodes for TRT, but it failed to build the TRT engine. I need to investigate further, but you can try your faster-rcnn model.

Update: Checked with Nvidia, those nodes should only work with TRT api enqueueV3, and TRT EP is using enqueueV2, so it's expected to see enqueue error. As for engine build error that I saw, will follow up with Nvidia. TensorRT EP is planning to update to use latest TRT apis, but it's going to take some time.

chilo-ms avatar Sep 07 '23 19:09 chilo-ms

I think we can try the TRT plugins. please see the doc here. You need to modify the graph and replace RoiAlign and NonMaxSuppression with the custom ops that will later map to trt plugins. (Remember to correctly put the name and domain of the custom node). Unfortunately, there is no related NonZeroPlugin for now.

chilo-ms avatar Sep 07 '23 21:09 chilo-ms

thx a lot @chilo-ms : I will try to integrate the 2 plugins in my model to test performance improvement. Hoping that ONNRT TRT EP to use TRT API enqueueV3 asap. Expect some time before next post as I am ooo next week.

datinje avatar Sep 08 '23 08:09 datinje

after discussing with NVIDIA on how to integrate plugins , we found out that NMS and nonzero ARE implemented in tensorRT . cf

  • NMS: https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_n_m_s_layer.html
  • NonZero: https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/class_i_non_zero.html

for ROIALign , the only way is via the TRT plugin, but is there a way to have TRT EP call the native TRT instruction to avoid data transfer between CPU and GPU ?

datinje avatar Sep 25 '23 11:09 datinje

in 1.16.0 there is this new session option disable_cpu_ep_fallback. How can we set it ? and will this prevent falling back nonZero and NMS on CPU EP ?

datinje avatar Sep 25 '23 15:09 datinje

@datinje Last time from Nvidia, they mentioned NMS and NonZero are natively supported only by enqueueV3 (TRT EP currently uses enqueueV2). I am current working on a dev branch to use enqueueV3. Before the dev branch is merged to main, i think you can only try TRT NMS/NonZero plugins, please see my previous reply for how to use it. (Note: i encountered engine build error, so i might also update the engine build api as well. Will let you know once the dev branch is ready and merged to main)

Please see here for how to use disable_cpu_ep_fallback. But in your case, you still need CUDA EP or CPU to run those three nodes if you don't want to use TRT plugins. If you use TRT plugin and because the whole model can be run by TRT, regardless of native TRT or TRT plugins, there should be no data transfer betwee CPU and GPU except the model input/output.

chilo-ms avatar Sep 29 '23 03:09 chilo-ms

As stated above by @chilo-ms , I tried in 1.16 to disable cpu ep fallback to try to avoid moving onnx operators to CPU if onnxrt parser estimated so , but effect is not to keep the operators on GPU with TRT as expected , it is preventing the program to continue .

Then what is the purpose of this option ? The mains interest for me would be for ONNRT to keep the Operators on the GPU even if faster on CPU because overhead of transferring the data would be offsetting the benefit.

2023-10-31 11:27:23.916547026 [E:onnxruntime:, inference_session.cc:1678 Initialize] This session contains graph nodes that are assigned to the default CPU EP, but fallback to CPU EP has been explicitly disabled by the user.

Traceback (most recent call last):

File "/cad-engine/run-onnx-pytorch.model.py", line 299, in

main()

File "/cad-engine/run-onnx-pytorch.model.py", line 60, in main

sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=providers)

File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in init

self._create_inference_session(providers, provider_options, disabled_optimizers)

File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 471, in _create_inference_session

sess.initialize_session(providers, provider_options, disabled_optimizers)

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : This session contains graph nodes that are assigned to the default CPU EP, but fallback to CPU EP has been explicitly disabled by the user.

datinje avatar Oct 31 '23 11:10 datinje

something wrong in the copy paste above , sorry. forget about the "File ..." lines.

datinje avatar Oct 31 '23 11:10 datinje

@datinje

Then what is the purpose of this option ?

One of the purposes of using this disable_cpu_ep_fallback is to make sure all the nodes are placed on GPUs before ORT starts to run inference. ORT may place some nodes on CPU for performance, but in some cases, it might not be the case. So this option works as a check.

However, in your case, the error you got is expected because current ORT TRT doesn't support NonZero, NMS and RoiAlign, and cpu is the only ep to run these nodes. So, only if all the nodes in your model are supported by ORT TRT, you are suggested to use disable_cpu_ep_fallback. Otherwise, you will get this error.

As I mentioned previously, you can try following steps:

  • Use the branch of this PR

  • Replace the line in deps.txt as below:

          - onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/a43ce67187bab219520fd80f21af8bbd4354bc8c.zip;572535aefef477050f86744dfab1fef840198035
          + onnx_tensorrt;https://github.com/onnx/onnx-tensorrt/archive/bacfaaa951653cd4e72efe727a543567cb38f7de.zip;26434329612e804164ab7baa6ae629ada56c1b26
    
  • Build ORT TRT using this branch with --use_oss_trt_parser

  • Run the model (Don't set disable_cpu_ep_fallback)

then you can see that ORT TRT can run all the nodes of your FasterRCNN model except RoiAlign.

chilo-ms avatar Nov 07 '23 19:11 chilo-ms

closing since I realized that with ORT 1.16.3 I succeeded runing my model with TRT and it gets faster than Cuda EP in TF32

jcdatin avatar Mar 16 '24 17:03 jcdatin

I tested again my model with latest onnxrt 1.17.1 and got same performance results between TRT EP and CUDA EP. I would have expected that TRT EP would have used NMS and nonzero TRT operator since onnxrt 1.17.1 now supports TRT API enqueue V3 which would allow to call them . Is there any date planeed for integrating these TRT op in onnxrt TRT EP ?

jcdatin avatar Apr 11 '24 09:04 jcdatin

even NonZero op seems implemented in TRT : could it be implemented in ONNXRT TRT EP ? With these 3 operator ALL of the faster-rcnns would run on TRT and avoid host to device memory transfers !

jcdatin avatar Apr 11 '24 10:04 jcdatin

https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_non_zero_layer.html

jcdatin avatar Apr 11 '24 10:04 jcdatin

I tested again my model with latest onnxrt 1.17.1 and got same performance results between TRT EP and CUDA EP. I would have expected that TRT EP would have used NMS and nonzero TRT operator since onnxrt 1.17.1 now supports TRT API enqueue V3 which would allow to call them . Is there any date planeed for integrating these TRT op in onnxrt TRT EP ?

@jcdatin Unfortunately, for ORT 1.17.x, TRT EP doesn't include those DDS operators (NMS/NonZero/RoiAlign). But, current ORT main branch + OSS onnx-tensorrt parser will make TRT EP use NMS/NonZero/RoiAlign TRT operators. You can simply build ORT main with --use_oss_trt_parser to achieve this.

We are testing TRT EP + TRT DDS output support (meaning including the NMS/NonZero/RoiAlign operators) to see the performance and then decide whether to enable this feature in the ORT official release.

If you could help test it and provide the feedback, that will be great!. Thank you!

chilo-ms avatar Apr 11 '24 16:04 chilo-ms

Sure ! I will help. definition :DDS ops means Data Dependent (dynamic) Shape operators : see https://forums.developer.nvidia.com/t/data-dependent-tensor-shapes-in-tensorrt/194988

jcdatin avatar Apr 11 '24 18:04 jcdatin

shall --use_oss_trt_parser REPLACE --use_tensorrt_builtin_parser or simply complete it

jcdatin avatar Apr 12 '24 17:04 jcdatin

if no parser related option specified or --use_tensorrt_builtin_parser is specified --> TRT EP will dynamically link against built-in parser. if --use_oss_trt_parser is sepcified --> ORT will build the onnx-tensorrt parser and TRT EP will statically link against it.

chilo-ms avatar Apr 12 '24 18:04 chilo-ms

for those who read, flag is actually --use_tensorrt_oss_parser . Retrying.

jcdatin avatar Apr 13 '24 07:04 jcdatin

tested 👍 2024-04-13 13:51:39.673832151 [V:onnxruntime:, session_state.cc:1149 VerifyEachNodeIsAssignedToAnEp] All nodes placed on [TensorrtExecutionProvider]. Number of nodes: 1 inference is now 3 times faster than before . 1. because of NO device to host transfer anymore and also it seems that more graph node fusion optimization occur. This is incredible . Even in my dreams could not have believed this . Congratulations Onnxruntime team and a ig thx @chilo-ms who supported me all this time ! This onnxrtuntime release with TRT EP is a MAJOR improvement !

jcdatin avatar Apr 13 '24 13:04 jcdatin

can't wait for the official release

jcdatin avatar Apr 13 '24 13:04 jcdatin

forgot to say : of course accuracy of results are same between CUDA EP and this new TRT EP

jcdatin avatar Apr 13 '24 14:04 jcdatin

@chilo-ms one question though : I am getting the same results of inference between ONNXRT +Cuda EP and ONNXRT + TRt EP for the same model and inputs for the same GPU . But the results are totally different between 2 GPUs (Turing-sm75 and ADA-sm89) my ONNXRT was compiled with flag --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" is this the reason ? I was under the impression that this flag meant that all GPUs ABOVE this sm75 architecture would also work the same albeit not as optimized in performance. Note I started using this flag to fix this issue : https://github.com/microsoft/onnxruntime/issues/18579

Here is now my current onnxrt build command.

RUN CC=gcc-11 CXX=g++-11 ./build.sh --config RelWithDebInfo --use_cuda --cudnn_home /usr/local/cuda/lib64 --cuda_home /usr/local/cuda/ --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /usr/local/TensorRT --build_shared_lib --parallel --skip_tests --allow_running_as_root --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=75" --cmake_extra_defines "CMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11"

I intend to use the onnxrt on GPUs turing, ampere and ada (sm75, sm86 and sm89) : so is this flag correct ? Should I use a list of GPU architectures I would run onnxrt on as "-DCMAKE_CUDA_ARCHITECTURES=75;80;90" as stated by @snnn in https://github.com/microsoft/onnxruntime/issues/19606 ?

jcdatin avatar Apr 13 '24 17:04 jcdatin

when using -DCMAKE_CUDA_ARCHITECTURES=75;86;89" or without using -DCMAKE_CUDA_ARCHITECTURES, then onnxrt build fails on my turing (sm_75) build machine on a sm80 cuda file: cc error : 'cicc' died due to signal 9 (Kill signal) gmake[2]: *** [CMakeFiles/onnxruntime_providers_cuda.dir/build.make:5808: CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_hdim160_bf16_sm80.cu.o] Error 9 I will try to build on the ADA board , but I would like to understand what is going on.

jcdatin avatar Apr 14 '24 15:04 jcdatin