TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

TensorRT 10.3 is 3+ times slower than p ytorch when running inference on Gpus A30 and 4090

Open CallmeZhangChenchen opened this issue 1 year ago • 12 comments

Description

Under the same conditions, my model inference speed tensort is several times slower than pytorch

Environment

TensorRT Version: TensorRT.trtexec [TensorRT v100300]

NVIDIA GPU: A30 & 4090

NVIDIA Driver Version: 535.104.05

CUDA Version: release 12.4, V12.4.131

CUDNN Version: **

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

https://drive.google.com/file/d/1V3wZFEyO6s3szE6tPhofa-bkY0Lqwu8M/view?usp=drive_link

Steps To Reproduce

./TensorRT-10.3.0.26/bin/trtexec --onnx=test_sim.onnx  --fp16 --shapes=phone:1x898x768,phone_lengths:1,pitch:1x898,pitchf:1x898,ds:1,rnd:1x192x898 --saveEngine=test.engine --builderOptimizationLevel=5
[08/26/2024-08:17:24] [I] GPU Compute Time: min = 817.994 ms, max = 820.003 ms, mean = 818.733 ms, median = 818.609 ms, percentile(90%) = 819.845 ms, percentile(95%) 
= 820.003 ms, percentile(99%) = 820.003 ms

pytorch uses the same input/output size, plus pre and post processing, and only needs 300ms

CallmeZhangChenchen avatar Aug 26 '24 10:08 CallmeZhangChenchen

[08/26/2024-08:05:39] [W] [TRT] Engine generation failed with backend strategy 4.
Error message: [randomFill.cpp::replaceFillNodesForMyelin::89] Error Code 2: Internal Error (Assertion node->backend == Backend::kMYELIN failed. ).
Skipping this backend strategy.

There was a warring when the model was converted

CallmeZhangChenchen avatar Aug 26 '24 10:08 CallmeZhangChenchen

Image

I think I found out why I'll take the time to study it

CallmeZhangChenchen avatar Aug 29 '24 09:08 CallmeZhangChenchen

According to the issue, the problem seems to be with the node that was offloaded to one of our backend DL graph compilers so we can investigate it internally. Can you confirm the source of the screenshot showing the ForeignNode?

moraxu avatar Aug 30 '24 21:08 moraxu

Image

I think I found out why I'll take the time to study it @moraxu Thanks for your attention.

Using nsys profile -o analysis_test trtexec *** , I exported an analysis file and then opened it with Nsight. There was a time-consuming operation that took 1.6s

The main time consuming point is between input op pitchf and /dec/m_source/l_tanh/Tanh, so my solution is to move this part off the network for now and use torch to reason

CallmeZhangChenchen avatar Sep 02 '24 03:09 CallmeZhangChenchen

@CallmeZhangChenchen , sorry for the late follow up - is this on Windows 10 or 11? If not, could you provide a specific OS version for us to reproduce?

moraxu avatar Sep 06 '24 23:09 moraxu

@moraxu Thanks! OS version: Ubuntu 22.04.4 LTS

CallmeZhangChenchen avatar Sep 09 '24 06:09 CallmeZhangChenchen

I've instanced an internal bug, thank you.

moraxu avatar Sep 09 '24 18:09 moraxu

@CallmeZhangChenchen , could you provide pytorch inference script as well? The issue is about comparison with pytorch, could be that TRT has bug or could be that the pytorch script is not actually doing the same workload.

Could you also provide the full trtexec --verbose log from your end, if possible?

moraxu avatar Sep 10 '24 22:09 moraxu

@moraxu I may not be able to provide a complete running pytorch script, because I have optimized the code here, and now the model has been reduced from 800ms to 27ms.

The original pytorch project, https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI

Transferring onnx script may not be so smooth, https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/infer/modules/onnx/export.py

trtexec --verbose log https://drive.google.com/file/d/1Uc_m2gP9QhjussV-rkJRsLPp7AdE2XLE/view?usp=drive_link

Get rid of time-consuming code, https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/infer/lib/infer_pack/models_onnx.py

def forward(self, x, upp=None):
        #sine_wavs, uv, _ = self.l_sin_gen(x, upp)
        #if self.is_half:
        #    sine_wavs = sine_wavs.half()
        #sine_merge = self.l_tanh(self.l_linear(sine_wavs))
        sine_merge = self.l_tanh(self.l_linear(x))
        return sine_merge, None, None  # noise, uv

This part takes a few ms using pytorch

Cropped onnx, https://drive.google.com/file/d/1ucjIDLpJfOMFIWVY8NKav6fa05KF4icd/view?usp=drive_link

CallmeZhangChenchen avatar Sep 11 '24 03:09 CallmeZhangChenchen

Thank you, I'll pass the info on

moraxu avatar Sep 11 '24 17:09 moraxu

@CallmeZhangChenchen the slowness here comes from the "CumSum" op, which is known to be super slow. We plan to fix this issue in TRT 10.6.

moraxu avatar Sep 16 '24 17:09 moraxu

Thank you and look forward to updating

CallmeZhangChenchen avatar Sep 18 '24 03:09 CallmeZhangChenchen