NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Possible to convert EncDecRNNTBPEModel into TensorRT and run them through Triton inference server?

Open jingzhaoo opened this issue 11 months ago • 1 comments

Is your feature request related to a problem? Please describe.

I wonder how to efficiently host the parakeet-tdt-0.6b-v2 model through the Triton inference server. I can use the python backend to run the model like here. I wonder if there are any docs guiding me how to convert the model to TensorRT engines and run through the Triton inference server. I appreciate your help!

Describe the solution you'd like

Convert the parakeet-tdt-0.6b-v2 model into TensorRT engines, and run it through Triton inference server.

Describe alternatives you've considered

Run the model as PyTorch model, which is quite fast already. I just want to run it even faster.

jingzhaoo avatar May 06 '25 06:05 jingzhaoo

You can export the Fast-conformer encoder to TRT (via ONNX). TensorRT doesn't support transducer (RNNT/TDT) and someone would have to implement it via a plugin in TRT.

anand-nv avatar May 08 '25 13:05 anand-nv

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jun 08 '25 02:06 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jun 15 '25 02:06 github-actions[bot]

@anand-nv any update on this ?

can we now fully convert Parakeet to TRT or does it still not support the transducer (RNNT/TDT). would be happy to submit a PR for TRT to support this

StephennFernandes avatar Oct 24 '25 04:10 StephennFernandes