Speculative decoding: https://arxiv.org/pdf/2211.17192

This will refactor the pipeline structure to make the sampling process more abstracted. Additionally, it will also abstract the scheduling and kv cache management.

Restriction

Requires same vocab

Algorithm

Given draft model q and target model p with probability distributions $q_i(x)$ and $p_i(x)$ for each token

Keep the sample for token i if $q_i(x)$ <= $p_i(x)$
- This means the target model agrees
Else (if $q_i(x)$ > $p_i(x)$ ) accept that token with prob $\frac{p_i(x)}{q_i(x)}$
- If rejected, sample token from from $p'(x) = norm(max(0, p(x) − q(x)))$ and do not take any more
- Note that that is really $p'(x) = ReLU(p(x) − q(x))$

Apr 28 '24 22:04 EricLBuehler

Code Metrics Report

  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        72     23863     1572       530    21761       1325
───────────────────────────────────────────────────────────────────────────────
Total                       72     23863     1572       530    21761       1325
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 85,737
Estimated Schedule Effort 11.916649 months
Estimated People Required 5.112342
───────────────────────────────────────────────────────────────────────────────
Processed 793364 bytes, 0.793 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

Apr 28 '24 22:04 github-actions[bot]

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: https://github.com/vllm-project/vllm/pull/2188

Apr 29 '24 02:04 kir-gadjello

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188

Yes, this implementation only checks if the vocabs are the same: see this check.

Apr 29 '24 02:04 EricLBuehler

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188

Yes, this implementation only checks if the vocabs are the same: see this check.

I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM is incompatible with it (which often will be the case).

May 01 '24 21:05 kir-gadjello

I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM is incompatible with it (which often will be the case).

That sounds great! Can you please give an example of how I should relax the requirement?

May 01 '24 21:05 EricLBuehler

This PR adds the base framework for SD. Further improvements to speed will be added in addition to self-speculative decoding.

May 11 '24 02:05 EricLBuehler

mistral.rs
mistral.rs copied to clipboard

Implement Speculative Decoding

Restriction

Algorithm

mistral.rs mistral.rs copied to clipboard

Implement Speculative Decoding

Restriction

Algorithm

mistral.rs
mistral.rs copied to clipboard