TensorRT-LLM enc-dec triton backend support

Hi is there any update on when enc-dec models like T5 will get the TRT-LLM Triton backend support? Posting an issue for awareness and just wanted to know if its still being planned. Thanks in advance!

https://github.com/NVIDIA/TensorRT-LLM/discussions/424#discussioncomment-7732258

Jan 03 '24 23:01 shannonphu

Hi @shannonphu , yes we're working on it. Right now it's at the stage of adding the C++ runtime. Tentative date for Triton enc-dec support is around mid to late January. Thanks for your patience

Jan 07 '24 00:01 symphonylyh

Hi @shannonphu , yes we're working on it. Right now it's at the stage of adding the C++ runtime. Tentative date for Triton enc-dec support is around mid to late January. Thanks for your patience

is it also included continuous batching?

Jan 08 '24 18:01 sihanwang41

is it also included continuous batching? Our current plan is to reach there by steps: (1) C++ runtime (2) regular Triton support (3) continous batching. Eventually we want to enable continus batching, but for the mid to late January release it's more likely to only have (1) and (2), with (3) coming right after it

Jan 08 '24 23:01 symphonylyh

@symphonylyh Could share if theres an update on this?

Feb 01 '24 20:02 mlmonk

Hi is there an update for this?

Feb 09 '24 04:02 shixianc

Hi @shannonphu , @sihanwang41 , @mlmonk , @shixianc , We have been actively working on this support but finding the amount of work is more than expected since we want to have a good implementation to support enc-dec and in general such 2-stage pipeline.

May I use this thread to collect your feedback so we can understand your need and prioritize better. I know @sihanwang41 specifically asked about continuous batching, i.e., inflight batching, but others didn't share the request info. Can you reply by describing if any of (1), (2), (3) would be helpful and can unblock you first: (1) a Triton Python backend support to run enc-dec model (2) a C++ runtime (no Triton) to run enc-dec model, without inflight batching (3) a Triton C++ backend to run enc-dec model, without inflight batching (4) a Triton C++ backend, with paged kv cache and inflight batching for enc-dec <-- final goal

Thanks

Feb 22 '24 01:02 symphonylyh

@symphonylyh Thanks for the update! Starting with (3) would unblock our team.

May I assume this would also have the classic dynamic batching supported?

Feb 22 '24 05:02 shixianc

@symphonylyh Thanks for the update! Starting with (3) would unblock our team.

May I assume this would also have the classic dynamic batching supported?

Got it, thanks for the input. By dynamic batching, do you mean the Triton's dynamic batching that has nothing to do with the inflight/continuous batching concept. If so, yes.

Feb 23 '24 10:02 symphonylyh

@symphonylyh (1) and/or (3). I am not super clear on the difference between the Python vs C++ backend. I was using this to build the engine https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/enc_dec/README.md

Feb 23 '24 22:02 shannonphu

We have been able to use Triton with enc_dec models, so I'm not sure what the difference that and (1) is. We find that the TPS for that implementation is quite slow are looking for ways to make it faster.

Agree that the end goal is (3).

On Fri, Feb 23, 2024, 5:31 PM Shannon Phu @.***> wrote:

@symphonylyh https://github.com/symphonylyh (1) and/or (3). I am not super clear on the difference between the Python vs C++ backend. I was using this to build the engine https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/enc_dec/README.md

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT-LLM/issues/800#issuecomment-1962087195, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSI76N7HW6ZZB2ANBZNQBDYVEKFBAVCNFSM6AAAAABBMET6RGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSGA4DOMJZGU . You are receiving this because you were mentioned.Message ID: @.***>

Feb 24 '24 00:02 mlmonk

@mlmonk Oh interesting, I was under the impression that we just couldn't serve T5 models on Triton yet because the TRT-LLM backend wasn't ready for it yet.

Feb 24 '24 05:02 shannonphu

@symphonylyh @shannonphu We have been able to use the Flan-T5 with Triton. I believe this is (1). You can reproduce it here. Note that this is much older version of both libraries when Flan-T5 was not officially supported.

Like @shixianc mentioned, (3) would unblock us and (4) would the ideal state. It would be great if you could share how far along you are with the (3) release.

Mar 07 '24 16:03 mlmonk

hey @symphonylyh , do you have any updates on the progress?

Mar 13 '24 10:03 LuckyL00ser

@symphonylyh, any progress?

Apr 11 '24 08:04 XiaobingSuper

Hello, @symphonylyh . Is there any progress on any of (1-4) ?

May 08 '24 12:05 TeamSeshDeadBoy

We would love (1)

May 13 '24 23:05 mrmuke

Hi @shannonphu , @sihanwang41 , @mlmonk , @shixianc, @LuckyL00ser , @XiaobingSuper @TeamSeshDeadBoy @mrmuke

As part of today's release #1725 , enc-dec C++ runtime has been successfully implemented with inflight batching and paged kv cache. Please have a try following the README C++ runtime section . This directly corresponds to (4) above, with Triton backend being added next.

Our roadmap next pretty soon:

Triton C++ backend is almost ready and to be released soon
Multi-GPU support

Jun 04 '24 16:06 symphonylyh

Thanks for the update! This is excellent news, I'm sure it was a lot of effort to make it happen.

Jun 05 '24 03:06 mlmonk

Hello @symphonylyh, Is there any progress on adding (1) ?

Jul 09 '24 09:07 HamzaG737

@HamzaG737 it's full-fledged now. For (1) Triton backend, you can follow the guide here: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md.

Also, closing this issue as support has been added

Jul 10 '24 05:07 symphonylyh

TensorRT-LLM TensorRT-LLM copied to clipboard

enc-dec triton backend support

TensorRT-LLM
TensorRT-LLM copied to clipboard