[Feature] Request to 8-bit Quantization of Attention with SageAttention
Checklist
- [X] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [X] 2. Please use English, otherwise it will be closed.
Motivation
As https://github.com/thu-ml/SageAttention mentioned, the quantized 8-bit attention will improvement the speed of inference about 2x and more with the same accuracy, so shall we give it a try or do some verification?
Related resources
github: https://github.com/thu-ml/SageAttention
contributions are welcome
I'd like to work on this issue. This would be my first open-source contribution, and I'm eager to learn and help out. Please let me know if you have any specific guidelines I should follow.
@PratishthaGaur https://sgl-project.github.io/references/contributor_guide.html please send a PR and we can review that.
Any progress? If not, me and @Qiaolin-Yu would like to take it. @zhaochenyang20
@Beichen-Ma Great! Go and have a try.
@zhaochenyang20 I can take a first pass if that’s open.
I’m planning to integrate SageAttention’s 8-bit path behind a runtime flag so we can benchmark and verify correctness before defaulting it on.
Plan:
Wrap SageAttention’s 8-bit kernels behind --quant_attention=8bit (default: off).
Add unit tests comparing logits and perplexity against FP16 baseline.
Benchmark throughput on A100 (and optionally 4090) — measure:
tokens/sec, memory footprint, and activation precision drift.
Report accuracy deltas on short/long contexts to validate Sage’s “≈ same accuracy” claim.
Once validated, we could expose it through sglang serve as an optional backend flag for inference-only workloads.
Does that direction align with what maintainers expect before submitting a PR?
@vyalamar great catch. Please go forward!