mistral.rs
mistral.rs copied to clipboard
Implement Speculative Decoding
Speculative decoding: https://arxiv.org/pdf/2211.17192
This will refactor the pipeline structure to make the sampling process more abstracted. Additionally, it will also abstract the scheduling and kv cache management.
Restriction
- Requires same vocab
Algorithm
Given draft model q and target model p with probability distributions $q_i(x)$ and $p_i(x)$ for each token
- Keep the sample for token i if $q_i(x)$ <= $p_i(x)$
- This means the target model agrees
- Else (if $q_i(x)$ > $p_i(x)$ ) accept that token with prob $\frac{p_i(x)}{q_i(x)}$
- If rejected, sample token from from $p'(x) = norm(max(0, p(x) − q(x)))$ and do not take any more
- Note that that is really $p'(x) = ReLU(p(x) − q(x))$
Code Metrics Report
─────────────────────────────────────────────────────────────────────────────── Language Files Lines Blanks Comments Code Complexity ─────────────────────────────────────────────────────────────────────────────── Rust 72 23863 1572 530 21761 1325 ─────────────────────────────────────────────────────────────────────────────── Total 72 23863 1572 530 21761 1325 ─────────────────────────────────────────────────────────────────────────────── Estimated Cost to Develop 85,737 Estimated Schedule Effort 11.916649 months Estimated People Required 5.112342 ─────────────────────────────────────────────────────────────────────────────── Processed 793364 bytes, 0.793 megabytes (SI) ───────────────────────────────────────────────────────────────────────────────
It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: https://github.com/vllm-project/vllm/pull/2188
It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188
Yes, this implementation only checks if the vocabs are the same: see this check.
It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188
Yes, this implementation only checks if the vocabs are the same: see this check.
I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM is incompatible with it (which often will be the case).
I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM is incompatible with it (which often will be the case).
That sounds great! Can you please give an example of how I should relax the requirement?
This PR adds the base framework for SD. Further improvements to speed will be added in addition to self-speculative decoding.