Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

provide speculative decoding through server example.

Motivation

Noticed this topic popped up in several comments (1, 2, 3) but it seems we haven't officially opened an issue for it. I'm creating this to provide a space for focused discussion on how we can implement this feature and actually get this started.

Possible Implementation

perhaps move speculative sampling implementation to common or sampling?

Mar 05 '24 02:03 mscheong01

Any updates for this?

Apr 28 '24 07:04 vietanh125

@vietanh125 Not yet, but contributions are welcome 😃

Apr 30 '24 06:04 mscheong01

There is ongoing related work in https://github.com/ggerganov/llama.cpp/pull/6828. Though I haven't had time to look in details yet

Apr 30 '24 09:04 ggerganov

Sorry, does that means the server doesn't support speculative decoding? However, I can run it with commands like below in Kubernetes

Just a sample:

spec:
  containers:
  - args:
    - -m
    - /workspace/models/llama-2-7b.Q8_0.gguf
    - -md
    - /workspace/models/llama-2-7b.Q2_K.gguf
    - --port
    - "8080"
    - --host
    - 0.0.0.0
    - -fa
    command:
    - ./llama-server

Sep 04 '24 06:09 kerthcet

Not yet supported

Sep 04 '24 06:09 ggerganov

Ok so the -md doesn't work here 😀

Sep 04 '24 07:09 kerthcet

Also interested in this PR. Thank you to everyone contributing to a solution here.

Sep 18 '24 15:09 etafund

The #6828 PR is a distinct technique that uses a lookup file to speculate tokens instead of using a draft model, there seems to have less speedup than draft-based speculative decoding.

Sep 24 '24 19:09 theo77186

Support would be really nice to have because now there is the offical llama 3.2 in 1b and 3b which should be suitable for 8/70b 3.1, at least according to the offical HF notebook: https://github.com/huggingface/huggingface-llama-recipes/blob/main/assisted_decoding_8B_1B.ipynb

Sep 28 '24 16:09 Hoernchen

Support would be really nice to have because now there is the offical llama 3.2 in 1b and 3b which should be suitable for 8/70b 3.1, at least according to the offical HF notebook: https://github.com/huggingface/huggingface-llama-recipes/blob/main/assisted_decoding_8B_1B.ipynb

Yeah definitely. With draft: LLama-3.2-3B Q8 and model LLama-3.1-70B-Instruct (Q5_K, to fit on 2 32GB Tesla V100)

we go from 10 t/s to 30 t/s. Very impressive I'd say.

CUDA_VISIBLE_DEVICES=0,1 ./llama-speculative \
-m Meta-Llama-3.1-70B-Instruct-Q5_K_M-00001-of-00002.gguf \
-md Llama-3.2-3B-Instruct-Q8_0.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage" \
 -t 4  -n 512 -c 8192 -s 8 --top_k 1 \
--draft 16 -ngl 88 -ngld 30 --temp 0

encoded   18 tokens in    0.286 seconds, speed:   62.924 t/s
decoded  514 tokens in   15.774 seconds, speed:   32.586 t/s

n_draft   = 16
n_predict = 514
n_drafted = 688
n_accept  = 470
accept    = 68.314%

draft:

llama_perf_context_print:        load time =    2505.46 ms
llama_perf_context_print: prompt eval time =    9569.42 ms /   103 tokens (   92.91 ms per token,    10.76 tokens per second)
llama_perf_context_print:        eval time =    5622.49 ms /   645 runs   (    8.72 ms per token,   114.72 tokens per second)
llama_perf_context_print:       total time =   16079.15 ms /   748 tokens

target:

llama_perf_sampler_print:    sampling time =     112.92 ms /   514 runs   (    0.22 ms per token,  4551.81 tokens per second)
llama_perf_context_print:        load time =   23527.56 ms
llama_perf_context_print: prompt eval time =    9077.52 ms /   749 tokens (   12.12 ms per token,    82.51 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   18584.66 ms /   750 tokens

EDIT: with LLama-3.2-1B Q8 that can go to 40 t/s

Oct 09 '24 18:10 gelim

Wait, what happened? I used to run llama-server with speculative decoding with -md. I just "upgraded" and -md went away. now there's a separate program called llama-speculative, but doesn't appear to be a server. Sigh :( Guess I have to downgrade and find the version where it went away....

Oct 23 '24 11:10 enn-nafnlaus

@enn-nafnlaus Did you find the version where it went away? Would appreciate any leads.

Oct 25 '24 06:10 etafund

The last commit with -md in llama-server was https://github.com/ggerganov/llama.cpp/commit/554c247caffed64465f372661f2826640cb10430 but it never worked anyway. The speculative decoding flags were silently discarded and no speculator model was loaded.

Oct 25 '24 06:10 theo77186

Came to ask the same as other folks have stated here - looks like -md is no longer an option for the server. @ggerganov do you have any plans to implement speculative decoding for the server component?

Nov 12 '24 09:11 sammcj

Is anyone working on this issue? Or is this possibly blocked by something?

I am already preparing for this feature to be implemented in Ollama, but depend on this feature being implemented in llama-server here.

I don't mind giving this issue here a shot, it is labeled as good first issue and if that's true would make it suitable for my first commit.

I had a quick look and from what I see there is already an example of implementation in speculative. I assume I can use that as a hint for implementing it at the server level.

Are there any additional pointers or specific considerations for the implementation I should be aware of?

Nov 12 '24 22:11 oxfighterjet

At the very least the llama-speculative example has to be fixed first (https://github.com/ggerganov/llama.cpp/issues/10176#issuecomment-2459450448) and then demonstrate some meaningful gains from having this feature implemented in the server.

Nov 13 '24 11:11 ggerganov

FWIW, I went to test this a.m. before I went hunting and stumbled into this thread:

encoded   25 tokens in    1.049 seconds, speed:   23.830 t/s
decoded  922 tokens in   60.516 seconds, speed:   15.236 t/s

n_draft   = 8
n_predict = 922
n_drafted = 1024
n_accept  = 793
accept    = 77.441%

draft:

llama_perf_context_print:        load time =    1968.76 ms
llama_perf_context_print: prompt eval time =   44285.19 ms /   280 tokens (  158.16 ms per token,     6.32 tokens per second)
llama_perf_context_print:        eval time =   15870.80 ms /   896 runs   (   17.71 ms per token,    56.46 tokens per second)
llama_perf_context_print:       total time =   61568.07 ms /  1176 tokens

target:

llama_perf_sampler_print:    sampling time =      48.98 ms /   922 runs   (    0.05 ms per token, 18822.47 tokens per second)
llama_perf_context_print:        load time =    2290.46 ms
llama_perf_context_print: prompt eval time =   40887.53 ms /  1177 tokens (   34.74 ms per token,    28.79 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   63536.87 ms /  1178 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating



real    1m6.106s
user    0m1.702s
sys     0m3.050s
(venv) bash-3.2$

using a q4_k_l Qwen2.5-Coder-7B-Instruct draft with a q4_k_l Qwen2.5-coder-32B-Instruct-GGUF main model (bartowski quants from hf)

llama_perf_sampler_print:    sampling time =      57.81 ms /  1049 runs   (    0.06 ms per token, 18145.02 tokens per second)
llama_perf_context_print:        load time =    1841.75 ms
llama_perf_context_print: prompt eval time =     311.89 ms /    25 tokens (   12.48 ms per token,    80.16 tokens per second)
llama_perf_context_print:        eval time =   99573.78 ms /  1023 runs   (   97.34 ms per token,    10.27 tokens per second)
llama_perf_context_print:       total time =  100001.10 ms /  1048 tokens
ggml_metal_free: deallocating

real    1m41.974s
user    0m1.666s
sys     0m1.412s
(venv) bash-3.2$

was perf without the draft model

m3max mbp 128GB.

~53% performance increase when using the draft model, based on time including the double-warmup for the speculative run.

Went immediately to see if I could add on server since I remembered abetlan merging draft model way back when although that required the python bindings, found this thread.

in case I was doing something errant, my CLI:

./llama-speculative -m /var/tmp/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -md /var/tmp/models/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/Qwen2.5-Coder-7B-Instruct-Q4_K_L.gguf -p "# FastAPI app for managing notes. Filenames are annotated as # relative/path/to/file.py\n\n#server/app.py\n" -e -ngl 999 -ngld 999 -c 0 -t 4 -n 1024 --draft 8

and

time ./llama-cli -m /var/tmp/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -p "# FastAPI app for managing notes. Filenames are annotated as # relative/path/to/file.py\n\n#server/app.py\n" -e -ngl 999 -c 0 -t 4 -n 1024

Nov 13 '24 16:11 m9e

~53% performance increase when using the draft model

Glad to hear this, this is pretty similar to ExllamaV2.

The Qwen 2.5 model family is a good example for this as well, you can basically use the small 1.5b or even 0.5b model for the draft with the big 72b model and get an excellent boost.

Nov 13 '24 23:11 sammcj

I also ran some smaller scale tests, which I wanted to share to bring some additional perspective (this is on an RTX 3060 with 12GB of VRAM, so can't find as large models):

./llama-cli -m llama3.1\:8b-instruct-q8_0 -p "I believe, in one sentence, that the meaning of life is" -ngl 33

llama_perf_sampler_print:    sampling time =     601.84 ms /   913 runs   (    0.66 ms per token,  1517.01 tokens per second)
llama_perf_context_print:        load time =    3575.03 ms
llama_perf_context_print: prompt eval time =      56.80 ms /    14 tokens (    4.06 ms per token,   246.46 tokens per second)
llama_perf_context_print:        eval time =   27636.04 ms /   898 runs   (   30.78 ms per token,    32.49 tokens per second)
llama_perf_context_print:       total time =   29298.72 ms /   912 tokens

./llama-speculative -m llama3.1\:8b-instruct-q8_0 -ngl 33 -md llama3.2\:1b-instruct-q8_0 -ngld 17 -p "I believe, in one sentence, that the meaning of life is"

encoded   14 tokens in    0.054 seconds, speed:  258.250 t/s
decoded  673 tokens in   12.132 seconds, speed:   55.473 t/s

n_draft   = 5
n_predict = 673
n_drafted = 730
n_accept  = 526
accept    = 72.055%

draft:

llama_perf_context_print:        load time =    2130.39 ms
llama_perf_context_print: prompt eval time =    7303.87 ms /   305 tokens (   23.95 ms per token,    41.76 tokens per second)
llama_perf_context_print:        eval time =    3685.16 ms /   584 runs   (    6.31 ms per token,   158.47 tokens per second)
llama_perf_context_print:       total time =   12191.61 ms /   889 tokens

target:

llama_perf_sampler_print:    sampling time =     466.15 ms /   673 runs   (    0.69 ms per token,  1443.75 tokens per second)
llama_perf_context_print:        load time =    3573.26 ms
llama_perf_context_print: prompt eval time =    4898.77 ms /   890 tokens (    5.50 ms per token,   181.68 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   14322.06 ms /   891 tokens

Which, if I read that correctly, bumps the speed from 32.49 t/s to 55.473 t/s, hence a speedup of 70% (excluding model loading times).

Edit: this was on 9fe0fb0.

Nov 19 '24 21:11 oxfighterjet

Any progress on allowing speculative decoding in the server?

Jan 12 '25 04:01 nullnuller

It's already supported - this issue hasn't been closed.

https://github.com/ggerganov/llama.cpp/pull/10455

Jan 12 '25 08:01 ggerganov

llama.cpp
llama.cpp copied to clipboard

Support speculative decoding in `server` example

Prerequisites

Feature Description

Motivation

Possible Implementation

llama.cpp llama.cpp copied to clipboard

Support speculative decoding in `server` example

Prerequisites

Feature Description

Motivation

Possible Implementation

llama.cpp
llama.cpp copied to clipboard