TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

enc-dec triton backend support

Open shannonphu opened this issue 1 year ago • 5 comments

Hi is there any update on when enc-dec models like T5 will get the TRT-LLM Triton backend support? Posting an issue for awareness and just wanted to know if its still being planned. Thanks in advance!

https://github.com/NVIDIA/TensorRT-LLM/discussions/424#discussioncomment-7732258

shannonphu avatar Jan 03 '24 23:01 shannonphu

Hi @shannonphu , yes we're working on it. Right now it's at the stage of adding the C++ runtime. Tentative date for Triton enc-dec support is around mid to late January. Thanks for your patience

symphonylyh avatar Jan 07 '24 00:01 symphonylyh

Hi @shannonphu , yes we're working on it. Right now it's at the stage of adding the C++ runtime. Tentative date for Triton enc-dec support is around mid to late January. Thanks for your patience

is it also included continuous batching?

sihanwang41 avatar Jan 08 '24 18:01 sihanwang41

is it also included continuous batching? Our current plan is to reach there by steps: (1) C++ runtime (2) regular Triton support (3) continous batching. Eventually we want to enable continus batching, but for the mid to late January release it's more likely to only have (1) and (2), with (3) coming right after it

symphonylyh avatar Jan 08 '24 23:01 symphonylyh

@symphonylyh Could share if theres an update on this?

mlmonk avatar Feb 01 '24 20:02 mlmonk

Hi is there an update for this?

shixianc avatar Feb 09 '24 04:02 shixianc

Hi @shannonphu , @sihanwang41 , @mlmonk , @shixianc , We have been actively working on this support but finding the amount of work is more than expected since we want to have a good implementation to support enc-dec and in general such 2-stage pipeline.

May I use this thread to collect your feedback so we can understand your need and prioritize better. I know @sihanwang41 specifically asked about continuous batching, i.e., inflight batching, but others didn't share the request info. Can you reply by describing if any of (1), (2), (3) would be helpful and can unblock you first: (1) a Triton Python backend support to run enc-dec model (2) a C++ runtime (no Triton) to run enc-dec model, without inflight batching (3) a Triton C++ backend to run enc-dec model, without inflight batching (4) a Triton C++ backend, with paged kv cache and inflight batching for enc-dec <-- final goal

Thanks

symphonylyh avatar Feb 22 '24 01:02 symphonylyh

@symphonylyh Thanks for the update! Starting with (3) would unblock our team.

May I assume this would also have the classic dynamic batching supported?

shixianc avatar Feb 22 '24 05:02 shixianc

@symphonylyh Thanks for the update! Starting with (3) would unblock our team.

May I assume this would also have the classic dynamic batching supported?

Got it, thanks for the input. By dynamic batching, do you mean the Triton's dynamic batching that has nothing to do with the inflight/continuous batching concept. If so, yes.

symphonylyh avatar Feb 23 '24 10:02 symphonylyh

@symphonylyh (1) and/or (3). I am not super clear on the difference between the Python vs C++ backend. I was using this to build the engine https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/enc_dec/README.md

shannonphu avatar Feb 23 '24 22:02 shannonphu

We have been able to use Triton with enc_dec models, so I'm not sure what the difference that and (1) is. We find that the TPS for that implementation is quite slow are looking for ways to make it faster.

Agree that the end goal is (3).

On Fri, Feb 23, 2024, 5:31 PM Shannon Phu @.***> wrote:

@symphonylyh https://github.com/symphonylyh (1) and/or (3). I am not super clear on the difference between the Python vs C++ backend. I was using this to build the engine https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/enc_dec/README.md

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT-LLM/issues/800#issuecomment-1962087195, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSI76N7HW6ZZB2ANBZNQBDYVEKFBAVCNFSM6AAAAABBMET6RGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSGA4DOMJZGU . You are receiving this because you were mentioned.Message ID: @.***>

mlmonk avatar Feb 24 '24 00:02 mlmonk

@mlmonk Oh interesting, I was under the impression that we just couldn't serve T5 models on Triton yet because the TRT-LLM backend wasn't ready for it yet.

shannonphu avatar Feb 24 '24 05:02 shannonphu

@symphonylyh @shannonphu We have been able to use the Flan-T5 with Triton. I believe this is (1). You can reproduce it here. Note that this is much older version of both libraries when Flan-T5 was not officially supported.

Like @shixianc mentioned, (3) would unblock us and (4) would the ideal state. It would be great if you could share how far along you are with the (3) release.

mlmonk avatar Mar 07 '24 16:03 mlmonk

hey @symphonylyh , do you have any updates on the progress?

LuckyL00ser avatar Mar 13 '24 10:03 LuckyL00ser

@symphonylyh, any progress?

XiaobingSuper avatar Apr 11 '24 08:04 XiaobingSuper

Hello, @symphonylyh . Is there any progress on any of (1-4) ?

TeamSeshDeadBoy avatar May 08 '24 12:05 TeamSeshDeadBoy

We would love (1)

mrmuke avatar May 13 '24 23:05 mrmuke

Hi @shannonphu , @sihanwang41 , @mlmonk , @shixianc, @LuckyL00ser , @XiaobingSuper @TeamSeshDeadBoy @mrmuke

As part of today's release #1725 , enc-dec C++ runtime has been successfully implemented with inflight batching and paged kv cache. Please have a try following the README C++ runtime section . This directly corresponds to (4) above, with Triton backend being added next.

Our roadmap next pretty soon:

  1. Triton C++ backend is almost ready and to be released soon
  2. Multi-GPU support

symphonylyh avatar Jun 04 '24 16:06 symphonylyh

Thanks for the update! This is excellent news, I'm sure it was a lot of effort to make it happen.

mlmonk avatar Jun 05 '24 03:06 mlmonk

Hello @symphonylyh, Is there any progress on adding (1) ?

HamzaG737 avatar Jul 09 '24 09:07 HamzaG737

@HamzaG737 it's full-fledged now. For (1) Triton backend, you can follow the guide here: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md.

Also, closing this issue as support has been added

symphonylyh avatar Jul 10 '24 05:07 symphonylyh