sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Feature] Request to 8-bit Quantization of Attention with SageAttention

Open Snowdar opened this issue 1 year ago • 5 comments

Checklist

  • [X] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [X] 2. Please use English, otherwise it will be closed.

Motivation

As https://github.com/thu-ml/SageAttention mentioned, the quantized 8-bit attention will improvement the speed of inference about 2x and more with the same accuracy, so shall we give it a try or do some verification?

Related resources

github: https://github.com/thu-ml/SageAttention

Snowdar avatar Oct 23 '24 09:10 Snowdar

contributions are welcome

merrymercy avatar Oct 24 '24 01:10 merrymercy

I'd like to work on this issue. This would be my first open-source contribution, and I'm eager to learn and help out. Please let me know if you have any specific guidelines I should follow.

PratishthaGaur avatar Nov 05 '24 02:11 PratishthaGaur

@PratishthaGaur https://sgl-project.github.io/references/contributor_guide.html please send a PR and we can review that.

merrymercy avatar Nov 14 '24 18:11 merrymercy

Any progress? If not, me and @Qiaolin-Yu would like to take it. @zhaochenyang20

Beichen-Ma avatar Feb 19 '25 16:02 Beichen-Ma

@Beichen-Ma Great! Go and have a try.

zhaochenyang20 avatar Feb 19 '25 18:02 zhaochenyang20

@zhaochenyang20 I can take a first pass if that’s open.

I’m planning to integrate SageAttention’s 8-bit path behind a runtime flag so we can benchmark and verify correctness before defaulting it on.

Plan:

Wrap SageAttention’s 8-bit kernels behind --quant_attention=8bit (default: off).

Add unit tests comparing logits and perplexity against FP16 baseline.

Benchmark throughput on A100 (and optionally 4090) — measure:

tokens/sec, memory footprint, and activation precision drift.

Report accuracy deltas on short/long contexts to validate Sage’s “≈ same accuracy” claim.

Once validated, we could expose it through sglang serve as an optional backend flag for inference-only workloads.

Does that direction align with what maintainers expect before submitting a PR?

vyalamar avatar Nov 06 '25 20:11 vyalamar

@vyalamar great catch. Please go forward!

zhaochenyang20 avatar Nov 07 '25 00:11 zhaochenyang20