aphrodite-engine
aphrodite-engine copied to clipboard
feat: TorchAO floating point quantization
This PR adds a custom floating point quantization method powered by TorchAO, which achieves a high throughput, thanks to the optimized fp6_llm kernel.
Use -q torchao --torchao-fp-bits 6 to load an FP16 model and convert it at runtime to fp6_e2m3 format.
Using with tensor parallelism currently results in degraded outputs.
Added splitK calculation, speeds up GSM8k benchmark run on L3-8B with bs=32 on a single 3090 ti from 8:47 to 7:57 and bs=1 throughput from 68 t/s to 93 t/s
I couldn't cleanly merge my changes with your latest commit, I'll add the SplitK myself in a bit (this is still a WIP)