sherpa-onnx icon indicating copy to clipboard operation
sherpa-onnx copied to clipboard

[Help wanted] Support TensorRT

Open csukuangfj opened this issue 2 years ago • 11 comments

TODO

  • [ ] Support GPU via TensorRT

See https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html

csukuangfj avatar Feb 20 '23 07:02 csukuangfj

I would like take on this.

  • [ ] Support the Onnxruntime CUDA provider.

yuekaizhang avatar Apr 25 '23 13:04 yuekaizhang

Hi @csukuangfj , @yuekaizhang

Observed that currently only CUDA EP support is there and TensorRT EP support is not there for onnxruntime. is there ay active developments going on for TensorRT GPU backend?

manickavela29 avatar Mar 14 '24 10:03 manickavela29

is there ay active developments going on for TensorRT GPU backend?

We don't have a plan to support it in the near future. Would you like to contribute?

csukuangfj avatar Mar 14 '24 12:03 csukuangfj

I tried adding triggering onnxruntime's tensorrt ep for zipfromer but the model performance was very bad, debugging further with standalone onnxruntime in python for Encoder models, will update if I see some good results.

manickavela29 avatar Mar 29 '24 07:03 manickavela29

Hi @csukuangfj, TensorRT has several parameters, and these will be only valid if TensorRT provider is chosen, so I need your suggestion on either of below 2.

  1. Putting TRT configs as part of the model-config.cc file model-config.cc
  2. Creating a new config for TRT and exposing the required parameters from it.

Thank you

manickavela29 avatar May 27 '24 13:05 manickavela29

Could you create a new config for tensorrt and add this config as a member field of OnlineModelConfig and OfflineModelConfig?

You can set the default values of this config as the one used in https://github.com/k2-fsa/sherpa-onnx/blob/b7148174739275dfc997af726be364245511239c/sherpa-onnx/csrc/session.cc#L137-L150

csukuangfj avatar May 28 '24 03:05 csukuangfj

yes, I will send the PR for configs separately in some time.

manickavela29 avatar Jun 03 '24 10:06 manickavela29

Current perf Trt Vs Cuda

Tensorrt csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 1.930044 ms csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 0.034984 ms csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 0.034912 ms csrc/online-websocket-server-impl.cc:Run:256 Warm up completed : 3 times. csrc/online-websocket-server.cc:main:79 Started! csrc/online-websocket-server.cc:main:80 Listening on: 6007 csrc/online-websocket-server.cc:main:81 Number of work threads: 8

Cuda csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 0.535651 ms csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 0.187492 ms csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 0.187698 ms

Apart from this, with Trt there is a huge session creation time. which is expected, only way to handle is to cache the engine images.

manickavela29 avatar Jun 04 '24 10:06 manickavela29

Current perf Cuda vs Trt

csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 1.930044 ms csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 0.034984 ms csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 0.034912 ms csrc/online-websocket-server-impl.cc:Run:256 Warm up completed : 3 times. csrc/online-websocket-server.cc:main:79 Started! csrc/online-websocket-server.cc:main:80 Listening on: 6007 csrc/online-websocket-server.cc:main:81 Number of work threads: 8

csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 0.535651 ms csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 0.187492 ms csrc/online-zipformer2-transducer-model.cc:RunEncoder:445 Encoder Duration : 0.187698 ms

Apart from this, with Trt there is a huge session creation time. which is expected, only way to handle is to cache the engine images.

May I know the results for CPU provider if you have? Also, could you explain why there are three lines for each block? e.g. 0.535651 ms 0.187492 ms 0.187698 ms. @manickavela29

yuekaizhang avatar Jun 05 '24 02:06 yuekaizhang

I can try to get for CPU numbers, but i don't have any high performance CPU,

(in between someone can add support for dnnl ep 🙂)

But here the focus itself is towards GPU with Cuda Vs Trt, is CPU benchmarking relevant?

Code blocks are just performance log which I added for zipformer. Those are not part of the patch

manickavela29 avatar Jun 05 '24 02:06 manickavela29

Hi @csukuangfj https://github.com/k2-fsa/sherpa-onnx/pull/992

will create configs for execution provider all together and integrate it with sessions. let me know if you have any other thoughts still WIP

manickavela29 avatar Jun 12 '24 02:06 manickavela29