feat: TorchAO floating point quantization

Open intervitens opened this issue 1 year ago • 2 comments

This PR adds a custom floating point quantization method powered by TorchAO, which achieves a high throughput, thanks to the optimized fp6_llm kernel.

Use -q torchao --torchao-fp-bits 6 to load an FP16 model and convert it at runtime to fp6_e2m3 format.

Using with tensor parallelism currently results in degraded outputs.

Jul 29 '24 05:07 intervitens

Added splitK calculation, speeds up GSM8k benchmark run on L3-8B with bs=32 on a single 3090 ti from 8:47 to 7:57 and bs=1 throughput from 68 t/s to 93 t/s

Jul 29 '24 10:07 intervitens

I couldn't cleanly merge my changes with your latest commit, I'll add the SplitK myself in a bit (this is still a WIP)

Jul 29 '24 16:07 AlpinDale