TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

T5 Model with TensorRT - Runtime on GPU

Open GenVr opened this issue 2 years ago • 14 comments

Hi, In reference to #10807, I'm trying to use my own T5-base on TensorRT. My model has a max_length=1024, and using TensorRT (I created the TensorRT container) I have worse times than pytorch. Maybe this is because with lengths greater than 256 it seems not to be convenient. Can you confirm this?

For more info: I'm using a TensorRT container for TRT, only this"setting up" code and then this code for conversion and inference.

Are they the most up-to-date versions and codes? Thanks.

GenVr avatar Mar 10 '22 08:03 GenVr

Yes happened with me as well. Its internal issue of TensorRT. I even tried with the latest 8.4 release as well. For 512 tokens the model was more than 2x slower than the torch model.

VikasOjha666 avatar Mar 10 '22 13:03 VikasOjha666

The decoder part exported to ONNX and reused by TensorRT is the one without caching (of K, V representations). Hugging face version enable caching by default. On short seq len, TRT is faster as Pytorch is both memory bound (takes time to read the cache) and has some overhead. Longer seq len, the time to recompute K, V representations for all tokens dominates TRT computation, where Hugging Face just compute the last token, whatever the seq len / batch size. That's why TRT is slower on long seq len, it's not really TRT fault, but the fact that exporting to ONNX from Pytorch is done with tracing and tracing delete any dynamic behavior... like using cached values.

pommedeterresautee avatar May 05 '22 10:05 pommedeterresautee

@pommedeterresautee You mean in TensorRT repo they export the model without past key values hence that's the reason right?.

VikasOjha666 avatar May 05 '22 11:05 VikasOjha666

Yes that's what I mean

pommedeterresautee avatar May 05 '22 11:05 pommedeterresautee

@pommedeterresautee so there is no solution for now, right?

GenVr avatar May 09 '22 07:05 GenVr

Just released a way to get caching on tensorrt

https://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/t5_tensorrt.py

It's a trt version of the ort notebook in the same folder which contain explanations.

pommedeterresautee avatar May 27 '22 11:05 pommedeterresautee

We are currently working on adding support for KV-cache to GPT2/T5 demo. Please stay tuned. Thanks for your patience.

nvpohanh avatar May 27 '22 11:05 nvpohanh

That's awesome! It has been quite painful to make it work on trt because of the limitation on the if node (same output shapes). We have a plan to clean things but it would probably be definitely better if done by the trt maintainers.

May I ask you if it will support any decoding algo like the new constrained beam search (https://huggingface.co/blog/constrained-beam-search)? (And the many decoding work recently published)

pommedeterresautee avatar May 27 '22 13:05 pommedeterresautee

@pommedeterresautee Thanks for sharing sir.

VikasOjha666 avatar May 27 '22 13:05 VikasOjha666

Any updates @nvpohanh? KV-cache for this implementation would be a game changer for us.

michaelroyzen avatar Jun 17 '22 02:06 michaelroyzen

We are currently working on adding support for KV-cache to GPT2/T5 demo. Please stay tuned. Thanks for your patience.

Thanks! Could I know the current status of KV-cache support for GPT/T5?

lanwuwei avatar Jun 30 '22 06:06 lanwuwei

It's still in progress. Thanks

nvpohanh avatar Jun 30 '22 07:06 nvpohanh

Checking in again @nvpohanh. Do you have any ETA for KV-cache support?

michaelroyzen avatar Aug 20 '22 19:08 michaelroyzen

@nvpohanh Any luck with the KV-cache support? I could help if I get proper contexual info.

pngmafia avatar Sep 19 '22 11:09 pngmafia

KV-cache has been added to BART: https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace#how-to-run-with-k-v-cache

We are migrating it to other models as well.

nvpohanh avatar Dec 02 '22 10:12 nvpohanh

The note from the KV-cache implementation on BART states: "Note: current implementation of K-V cache does not exhibit performance gain over the non K-V cache TensorRT version. Please consider to skip the K-V cache experiment for now."

If this will be the case with the T5 KV-cache, then I don't think that will help with the original issue of the OP.

zoltan-fedor avatar Feb 03 '23 02:02 zoltan-fedor

Based on our internal measurements with the upcoming TRT 8.6 release, using KV-cache does result in pretty good performance gain. Stay tuned! :)

nvpohanh avatar Feb 03 '23 03:02 nvpohanh

We are very much looking forward to that! Hopefully that also applies to the scenario of the OP - large inputs to T5 models.

zoltan-fedor avatar Feb 03 '23 03:02 zoltan-fedor

What is the timeline of the 8.6 release?

lakshaykc avatar Feb 03 '23 09:02 lakshaykc

While waiting for this update, we started using NVIDIA's FasterTransformer library instead. It has a highly optimized T5 GPU runtime with KV cache supported and it's 5-10x faster than running Huggingface + PyTorch on equivalent hardware. Interesting to see how this compares.

michaelroyzen avatar Feb 04 '23 00:02 michaelroyzen

While waiting for this update, we started using NVIDIA's FasterTransformer library instead. It has a highly optimized T5 GPU runtime with KV cache supported and it's 5-10x faster than running Huggingface + PyTorch on equivalent hardware. Interesting to see how this compares.

5-10x faster than running Huggingface + Pytorch? That is major. Is that kind of performance improvement there even for larger inputs to the T5 model, like hundreds of tokens?

zoltan-fedor avatar Feb 04 '23 00:02 zoltan-fedor

While waiting for this update, we started using NVIDIA's FasterTransformer library instead. It has a highly optimized T5 GPU runtime with KV cache supported and it's 5-10x faster than running Huggingface + PyTorch on equivalent hardware. Interesting to see how this compares.

Did you see any drop in performance? I'm seeing a big drop with FasterTransformer + Triton Server.

lakshaykc avatar Feb 06 '23 07:02 lakshaykc

@lakshaykc Is model performing worse on larger sequences?

VikasOjha666 avatar Feb 06 '23 07:02 VikasOjha666

Yes, I've only been working with larger sequences.

lakshaykc avatar Feb 06 '23 08:02 lakshaykc

Does tensorrt/ fasterttransformer support FlanT5-XL? @nvpohanh

nimsala1234 avatar Mar 08 '23 12:03 nimsala1234

@michaelroyzen Did you see a huge drop in performance?

nimsala1234 avatar Mar 08 '23 12:03 nimsala1234

Hi @nvpohanh, has there been any update since your last message?

fxmarty avatar Mar 29 '23 18:03 fxmarty

The KV-cache has been added to the TRT 8.6 EA release, so you should already see a decent perf improvement for FP32. As for FP16, we are still debugging some perf issues and it is expected to be fixed in TRT 8.6 GA release. Thanks

nvpohanh avatar Mar 30 '23 01:03 nvpohanh

Hi @nvpohanh, has FP16 KV-cache been added to the TRT 8.6 GA release (TensorRT OSS 8.6.1)?

lingffff avatar May 11 '23 14:05 lingffff

Yes, you should be able to see FP16 kv-cache with fused MHA now.

nvpohanh avatar May 15 '23 05:05 nvpohanh