rtp-llm
rtp-llm copied to clipboard
[Feature] New Speculative Decoding Framework
Due to the high CPU overhead in the existing speculative decoding framework, we are developing a brand new framework that significantly reduces CPU consumption and minimizes device-to-host synchronization.
Worklist
- [ ] support 1 & multi steps mtp [#305]
- [ ] support py model & cuda graph
- [ ] support PD-seperation
- [ ] support DP
- [ ] fast & async mtp process
- [ ] vocab prune