rtp-llm icon indicating copy to clipboard operation
rtp-llm copied to clipboard

[Feature] New Speculative Decoding Framework

Open Vinkle-hzt opened this issue 2 months ago • 0 comments

Due to the high CPU overhead in the existing speculative decoding framework, we are developing a brand new framework that significantly reduces CPU consumption and minimizes device-to-host synchronization.

Worklist

  • [ ] support 1 & multi steps mtp [#305]
  • [ ] support py model & cuda graph
  • [ ] support PD-seperation
  • [ ] support DP
  • [ ] fast & async mtp process
  • [ ] vocab prune

Vinkle-hzt avatar Oct 31 '25 07:10 Vinkle-hzt