text-generation-inference
text-generation-inference copied to clipboard
TransformerEngine FP8 speedup
Feature request
Please help me implementing the speedup generated by using the TransformerEngine of the hopper H100 GPUs
https://github.com/NVIDIA/TransformerEngine https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Motivation
Inference speedup
Your contribution
I am working on bringing TE support to lit-llama. However, as this is difficult for me, I am very willing to work on a PR once a solid base is there.
Thank you very much for helping!
Hi @SinanAkkoyun ,
Whenever we get our hands on some H100 and can debug code for it, we will likely implement.
Hi @Narsil Thank you very much for the reply 😊 There are actually H100 instances available on Lambda Cloud GPU for $2.40/h
I would be very open to help and also cover some cloud GPU costs if you'd like that, as soon as this is high enough on the priority list
Thanks for the proposal, it's very kind. We don't need the financial support though. Cheers.
Has there been any development on this? From what I understand FP8 support is still quite limited in TGI (the docs mention this is not the fastest due to unpacking and padding). Am I correct in understanding that this would replace the current FP8 implementation, or would it be parallel to it?
If this is currently low priority otherwise I think I could come up with a draft PR, though I'll admit I'm also a bit out of my depth here. The accelerate implementation would probably a good point of reference in that case?
edit: I also noticed that there's work being done on fp8-kv-caching, I assume there's overlap there?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.