SpecForge
SpecForge copied to clipboard
Why is the TPS of eagle3-qwen in the sglang inference of single-card H20 not as high as that of the original QWEN3 when the decoding algorithm is added
Hello, I'm testing the speed of 100 tokens on a single H20. The original qwen3 has 200TPS during sglang inference, while the draft model eagle3 only has 130TPS. What's the reason for this
Performance degradation would happen when the load is very high. Because spec decoding will introduce extra computation.