TensorRT
TensorRT copied to clipboard
T5 Model with TensorRT - Runtime on GPU
Hi,
In reference to #10807,
I'm trying to use my own T5-base on TensorRT. My model has a max_length=1024
, and using TensorRT (I created the TensorRT container) I have worse times than pytorch. Maybe this is because with lengths greater than 256 it seems not to be convenient. Can you confirm this?
For more info: I'm using a TensorRT container for TRT, only this"setting up" code and then this code for conversion and inference.
Are they the most up-to-date versions and codes? Thanks.
Yes happened with me as well. Its internal issue of TensorRT. I even tried with the latest 8.4 release as well. For 512 tokens the model was more than 2x slower than the torch model.
The decoder part exported to ONNX and reused by TensorRT is the one without caching (of K, V representations). Hugging face version enable caching by default. On short seq len, TRT is faster as Pytorch is both memory bound (takes time to read the cache) and has some overhead. Longer seq len, the time to recompute K, V representations for all tokens dominates TRT computation, where Hugging Face just compute the last token, whatever the seq len / batch size. That's why TRT is slower on long seq len, it's not really TRT fault, but the fact that exporting to ONNX from Pytorch is done with tracing and tracing delete any dynamic behavior... like using cached values.
@pommedeterresautee You mean in TensorRT repo they export the model without past key values hence that's the reason right?.
Yes that's what I mean
@pommedeterresautee so there is no solution for now, right?
Just released a way to get caching on tensorrt
https://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/t5_tensorrt.py
It's a trt version of the ort notebook in the same folder which contain explanations.
We are currently working on adding support for KV-cache to GPT2/T5 demo. Please stay tuned. Thanks for your patience.
That's awesome! It has been quite painful to make it work on trt because of the limitation on the if node (same output shapes). We have a plan to clean things but it would probably be definitely better if done by the trt maintainers.
May I ask you if it will support any decoding algo like the new constrained beam search (https://huggingface.co/blog/constrained-beam-search)? (And the many decoding work recently published)
@pommedeterresautee Thanks for sharing sir.
Any updates @nvpohanh? KV-cache for this implementation would be a game changer for us.
We are currently working on adding support for KV-cache to GPT2/T5 demo. Please stay tuned. Thanks for your patience.
Thanks! Could I know the current status of KV-cache support for GPT/T5?
It's still in progress. Thanks
Checking in again @nvpohanh. Do you have any ETA for KV-cache support?
@nvpohanh Any luck with the KV-cache support? I could help if I get proper contexual info.
KV-cache has been added to BART: https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace#how-to-run-with-k-v-cache
We are migrating it to other models as well.
The note from the KV-cache implementation on BART states: "Note: current implementation of K-V cache does not exhibit performance gain over the non K-V cache TensorRT version. Please consider to skip the K-V cache experiment for now."
If this will be the case with the T5 KV-cache, then I don't think that will help with the original issue of the OP.
Based on our internal measurements with the upcoming TRT 8.6 release, using KV-cache does result in pretty good performance gain. Stay tuned! :)
We are very much looking forward to that! Hopefully that also applies to the scenario of the OP - large inputs to T5 models.
What is the timeline of the 8.6 release?
While waiting for this update, we started using NVIDIA's FasterTransformer library instead. It has a highly optimized T5 GPU runtime with KV cache supported and it's 5-10x faster than running Huggingface + PyTorch on equivalent hardware. Interesting to see how this compares.
While waiting for this update, we started using NVIDIA's FasterTransformer library instead. It has a highly optimized T5 GPU runtime with KV cache supported and it's 5-10x faster than running Huggingface + PyTorch on equivalent hardware. Interesting to see how this compares.
5-10x faster than running Huggingface + Pytorch? That is major. Is that kind of performance improvement there even for larger inputs to the T5 model, like hundreds of tokens?
While waiting for this update, we started using NVIDIA's FasterTransformer library instead. It has a highly optimized T5 GPU runtime with KV cache supported and it's 5-10x faster than running Huggingface + PyTorch on equivalent hardware. Interesting to see how this compares.
Did you see any drop in performance? I'm seeing a big drop with FasterTransformer + Triton Server.
@lakshaykc Is model performing worse on larger sequences?
Yes, I've only been working with larger sequences.
Does tensorrt/ fasterttransformer support FlanT5-XL? @nvpohanh
@michaelroyzen Did you see a huge drop in performance?
Hi @nvpohanh, has there been any update since your last message?
The KV-cache has been added to the TRT 8.6 EA release, so you should already see a decent perf improvement for FP32. As for FP16, we are still debugging some perf issues and it is expected to be fixed in TRT 8.6 GA release. Thanks
Hi @nvpohanh, has FP16 KV-cache been added to the TRT 8.6 GA release (TensorRT OSS 8.6.1)?
Yes, you should be able to see FP16 kv-cache with fused MHA now.