bitsandbytes
bitsandbytes copied to clipboard
I wonder how does this compare and if you could leverage Nvidia transformer engine
In the 2023 architecture Hopper:
The Transformer Engine intelligently manages and dynamically chooses between FP8 and 16-bit calculations, automatically handling re-casting and scaling between FP8 and 16-bit in each layer to deliver up to 9x faster AI training and up to 30x faster AI inference speedups on large language models compared to the prior generation A100.
The transformer engine would be a perfect fit for this LLM.int8() algorithm. However, at this point, not enough details are known to say how big the advantage would be. Once all the details are released, we will assess how it can be integrated into bitsandbytes.
Hi @TimDettmers Is this something that would be worth looking into again, I'm out of my league here but willing to help in any way fp8 on the 4090 would be big for the community