[Feature] New Speculative Decoding Framework

Open Vinkle-hzt opened this issue 2 months ago • 0 comments

Due to the high CPU overhead in the existing speculative decoding framework, we are developing a brand new framework that significantly reduces CPU consumption and minimizes device-to-host synchronization.

Worklist

[ ] support 1 & multi steps mtp [#305]
[ ] support py model & cuda graph
[ ] support PD-seperation
[ ] support DP
[ ] fast & async mtp process
[ ] vocab prune

Oct 31 '25 07:10 Vinkle-hzt