TensorRT-LLM
TensorRT-LLM copied to clipboard
enc-dec triton backend support
Hi is there any update on when enc-dec models like T5 will get the TRT-LLM Triton backend support? Posting an issue for awareness and just wanted to know if its still being planned. Thanks in advance!
https://github.com/NVIDIA/TensorRT-LLM/discussions/424#discussioncomment-7732258
Hi @shannonphu , yes we're working on it. Right now it's at the stage of adding the C++ runtime. Tentative date for Triton enc-dec support is around mid to late January. Thanks for your patience
Hi @shannonphu , yes we're working on it. Right now it's at the stage of adding the C++ runtime. Tentative date for Triton enc-dec support is around mid to late January. Thanks for your patience
is it also included continuous batching?
is it also included continuous batching? Our current plan is to reach there by steps: (1) C++ runtime (2) regular Triton support (3) continous batching. Eventually we want to enable continus batching, but for the mid to late January release it's more likely to only have (1) and (2), with (3) coming right after it
@symphonylyh Could share if theres an update on this?
Hi is there an update for this?
Hi @shannonphu , @sihanwang41 , @mlmonk , @shixianc , We have been actively working on this support but finding the amount of work is more than expected since we want to have a good implementation to support enc-dec and in general such 2-stage pipeline.
May I use this thread to collect your feedback so we can understand your need and prioritize better. I know @sihanwang41 specifically asked about continuous batching, i.e., inflight batching, but others didn't share the request info. Can you reply by describing if any of (1), (2), (3) would be helpful and can unblock you first: (1) a Triton Python backend support to run enc-dec model (2) a C++ runtime (no Triton) to run enc-dec model, without inflight batching (3) a Triton C++ backend to run enc-dec model, without inflight batching (4) a Triton C++ backend, with paged kv cache and inflight batching for enc-dec <-- final goal
Thanks
@symphonylyh Thanks for the update! Starting with (3) would unblock our team.
May I assume this would also have the classic dynamic batching supported?
@symphonylyh Thanks for the update! Starting with (3) would unblock our team.
May I assume this would also have the classic dynamic batching supported?
Got it, thanks for the input. By dynamic batching, do you mean the Triton's dynamic batching that has nothing to do with the inflight/continuous batching concept. If so, yes.
@symphonylyh (1) and/or (3). I am not super clear on the difference between the Python vs C++ backend. I was using this to build the engine https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/enc_dec/README.md
We have been able to use Triton with enc_dec models, so I'm not sure what the difference that and (1) is. We find that the TPS for that implementation is quite slow are looking for ways to make it faster.
Agree that the end goal is (3).
On Fri, Feb 23, 2024, 5:31 PM Shannon Phu @.***> wrote:
@symphonylyh https://github.com/symphonylyh (1) and/or (3). I am not super clear on the difference between the Python vs C++ backend. I was using this to build the engine https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/enc_dec/README.md
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT-LLM/issues/800#issuecomment-1962087195, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSI76N7HW6ZZB2ANBZNQBDYVEKFBAVCNFSM6AAAAABBMET6RGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSGA4DOMJZGU . You are receiving this because you were mentioned.Message ID: @.***>
@mlmonk Oh interesting, I was under the impression that we just couldn't serve T5 models on Triton yet because the TRT-LLM backend wasn't ready for it yet.
@symphonylyh @shannonphu We have been able to use the Flan-T5 with Triton. I believe this is (1). You can reproduce it here. Note that this is much older version of both libraries when Flan-T5 was not officially supported.
Like @shixianc mentioned, (3) would unblock us and (4) would the ideal state. It would be great if you could share how far along you are with the (3) release.
hey @symphonylyh , do you have any updates on the progress?
@symphonylyh, any progress?
Hello, @symphonylyh . Is there any progress on any of (1-4) ?
We would love (1)
Hi @shannonphu , @sihanwang41 , @mlmonk , @shixianc, @LuckyL00ser , @XiaobingSuper @TeamSeshDeadBoy @mrmuke
As part of today's release #1725 , enc-dec C++ runtime has been successfully implemented with inflight batching and paged kv cache. Please have a try following the README C++ runtime section . This directly corresponds to (4) above, with Triton backend being added next.
Our roadmap next pretty soon:
- Triton C++ backend is almost ready and to be released soon
- Multi-GPU support
Thanks for the update! This is excellent news, I'm sure it was a lot of effort to make it happen.
Hello @symphonylyh, Is there any progress on adding (1) ?
@HamzaG737 it's full-fledged now. For (1) Triton backend, you can follow the guide here: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md.
Also, closing this issue as support has been added