text-generation-inference TransformerEngine FP8 speedup

Feature request

Please help me implementing the speedup generated by using the TransformerEngine of the hopper H100 GPUs

https://github.com/NVIDIA/TransformerEngine https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Motivation

Inference speedup

Your contribution

I am working on bringing TE support to lit-llama. However, as this is difficult for me, I am very willing to work on a PR once a solid base is there.

Thank you very much for helping!

May 29 '23 00:05 SinanAkkoyun

Hi @SinanAkkoyun ,

Whenever we get our hands on some H100 and can debug code for it, we will likely implement.

Jun 05 '23 13:06 Narsil

Hi @Narsil Thank you very much for the reply 😊 There are actually H100 instances available on Lambda Cloud GPU for $2.40/h

I would be very open to help and also cover some cloud GPU costs if you'd like that, as soon as this is high enough on the priority list

Jun 11 '23 22:06 SinanAkkoyun

Thanks for the proposal, it's very kind. We don't need the financial support though. Cheers.

Jun 12 '23 08:06 Narsil

Has there been any development on this? From what I understand FP8 support is still quite limited in TGI (the docs mention this is not the fastest due to unpacking and padding). Am I correct in understanding that this would replace the current FP8 implementation, or would it be parallel to it?

If this is currently low priority otherwise I think I could come up with a draft PR, though I'll admit I'm also a bit out of my depth here. The accelerate implementation would probably a good point of reference in that case?

edit: I also noticed that there's work being done on fp8-kv-caching, I assume there's overlap there?

Jun 20 '24 08:06 stefanobranco

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Aug 08 '24 01:08 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

TransformerEngine FP8 speedup

Feature request

Motivation

Your contribution

text-generation-inference
text-generation-inference copied to clipboard