SpecForge icon indicating copy to clipboard operation
SpecForge copied to clipboard

Why is the TPS of eagle3-qwen in the sglang inference of single-card H20 not as high as that of the original QWEN3 when the decoding algorithm is added

Open positive666 opened this issue 5 months ago • 1 comments

Hello, I'm testing the speed of 100 tokens on a single H20. The original qwen3 has 200TPS during sglang inference, while the draft model eagle3 only has 130TPS. What's the reason for this

positive666 avatar Sep 17 '25 02:09 positive666

Performance degradation would happen when the load is very high. Because spec decoding will introduce extra computation.

justadogistaken avatar Sep 17 '25 03:09 justadogistaken