mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

Implement Speculative Decoding

Open EricLBuehler opened this issue 9 months ago • 5 comments

Speculative decoding: https://arxiv.org/pdf/2211.17192

This will refactor the pipeline structure to make the sampling process more abstracted. Additionally, it will also abstract the scheduling and kv cache management.

Restriction

  • Requires same vocab

Algorithm

Given draft model q and target model p with probability distributions $q_i(x)$ and $p_i(x)$ for each token

  • Keep the sample for token i if $q_i(x)$ <= $p_i(x)$
    • This means the target model agrees
  • Else (if $q_i(x)$ > $p_i(x)$ ) accept that token with prob $\frac{p_i(x)}{q_i(x)}$
    • If rejected, sample token from from $p'(x) = norm(max(0, p(x) − q(x)))$ and do not take any more
    • Note that that is really $p'(x) = ReLU(p(x) − q(x))$

EricLBuehler avatar Apr 28 '24 22:04 EricLBuehler

Code Metrics Report
  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        72     23863     1572       530    21761       1325
───────────────────────────────────────────────────────────────────────────────
Total                       72     23863     1572       530    21761       1325
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 85,737
Estimated Schedule Effort 11.916649 months
Estimated People Required 5.112342
───────────────────────────────────────────────────────────────────────────────
Processed 793364 bytes, 0.793 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────
  

github-actions[bot] avatar Apr 28 '24 22:04 github-actions[bot]

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: https://github.com/vllm-project/vllm/pull/2188

kir-gadjello avatar Apr 29 '24 02:04 kir-gadjello

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188

Yes, this implementation only checks if the vocabs are the same: see this check.

EricLBuehler avatar Apr 29 '24 02:04 EricLBuehler

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188

Yes, this implementation only checks if the vocabs are the same: see this check.

I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM is incompatible with it (which often will be the case).

kir-gadjello avatar May 01 '24 21:05 kir-gadjello

I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM is incompatible with it (which often will be the case).

That sounds great! Can you please give an example of how I should relax the requirement?

EricLBuehler avatar May 01 '24 21:05 EricLBuehler

This PR adds the base framework for SD. Further improvements to speed will be added in addition to self-speculative decoding.

EricLBuehler avatar May 11 '24 02:05 EricLBuehler