llama.cpp
llama.cpp copied to clipboard
Support speculative decoding in `server` example
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Description
provide speculative decoding through server example.
Motivation
Noticed this topic popped up in several comments (1, 2, 3) but it seems we haven't officially opened an issue for it. I'm creating this to provide a space for focused discussion on how we can implement this feature and actually get this started.
Possible Implementation
perhaps move speculative sampling implementation to common or sampling?
Any updates for this?
@vietanh125 Not yet, but contributions are welcome 😃
There is ongoing related work in https://github.com/ggerganov/llama.cpp/pull/6828. Though I haven't had time to look in details yet
Sorry, does that means the server doesn't support speculative decoding? However, I can run it with commands like below in Kubernetes
Just a sample:
spec:
containers:
- args:
- -m
- /workspace/models/llama-2-7b.Q8_0.gguf
- -md
- /workspace/models/llama-2-7b.Q2_K.gguf
- --port
- "8080"
- --host
- 0.0.0.0
- -fa
command:
- ./llama-server
Not yet supported
Ok so the -md doesn't work here 😀
Also interested in this PR. Thank you to everyone contributing to a solution here.
The #6828 PR is a distinct technique that uses a lookup file to speculate tokens instead of using a draft model, there seems to have less speedup than draft-based speculative decoding.
Support would be really nice to have because now there is the offical llama 3.2 in 1b and 3b which should be suitable for 8/70b 3.1, at least according to the offical HF notebook: https://github.com/huggingface/huggingface-llama-recipes/blob/main/assisted_decoding_8B_1B.ipynb
Support would be really nice to have because now there is the offical llama 3.2 in 1b and 3b which should be suitable for 8/70b 3.1, at least according to the offical HF notebook: https://github.com/huggingface/huggingface-llama-recipes/blob/main/assisted_decoding_8B_1B.ipynb
Yeah definitely. With draft: LLama-3.2-3B Q8 and model LLama-3.1-70B-Instruct (Q5_K, to fit on 2 32GB Tesla V100)
we go from 10 t/s to 30 t/s. Very impressive I'd say.
CUDA_VISIBLE_DEVICES=0,1 ./llama-speculative \
-m Meta-Llama-3.1-70B-Instruct-Q5_K_M-00001-of-00002.gguf \
-md Llama-3.2-3B-Instruct-Q8_0.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage" \
-t 4 -n 512 -c 8192 -s 8 --top_k 1 \
--draft 16 -ngl 88 -ngld 30 --temp 0
encoded 18 tokens in 0.286 seconds, speed: 62.924 t/s
decoded 514 tokens in 15.774 seconds, speed: 32.586 t/s
n_draft = 16
n_predict = 514
n_drafted = 688
n_accept = 470
accept = 68.314%
draft:
llama_perf_context_print: load time = 2505.46 ms
llama_perf_context_print: prompt eval time = 9569.42 ms / 103 tokens ( 92.91 ms per token, 10.76 tokens per second)
llama_perf_context_print: eval time = 5622.49 ms / 645 runs ( 8.72 ms per token, 114.72 tokens per second)
llama_perf_context_print: total time = 16079.15 ms / 748 tokens
target:
llama_perf_sampler_print: sampling time = 112.92 ms / 514 runs ( 0.22 ms per token, 4551.81 tokens per second)
llama_perf_context_print: load time = 23527.56 ms
llama_perf_context_print: prompt eval time = 9077.52 ms / 749 tokens ( 12.12 ms per token, 82.51 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 18584.66 ms / 750 tokens
EDIT: with LLama-3.2-1B Q8 that can go to 40 t/s
Wait, what happened? I used to run llama-server with speculative decoding with -md. I just "upgraded" and -md went away. now there's a separate program called llama-speculative, but doesn't appear to be a server. Sigh :( Guess I have to downgrade and find the version where it went away....
@enn-nafnlaus Did you find the version where it went away? Would appreciate any leads.
The last commit with -md in llama-server was https://github.com/ggerganov/llama.cpp/commit/554c247caffed64465f372661f2826640cb10430 but it never worked anyway. The speculative decoding flags were silently discarded and no speculator model was loaded.
Came to ask the same as other folks have stated here - looks like -md is no longer an option for the server. @ggerganov do you have any plans to implement speculative decoding for the server component?
Is anyone working on this issue? Or is this possibly blocked by something?
I am already preparing for this feature to be implemented in Ollama, but depend on this feature being implemented in llama-server here.
I don't mind giving this issue here a shot, it is labeled as good first issue and if that's true would make it suitable for my first commit.
I had a quick look and from what I see there is already an example of implementation in speculative. I assume I can use that as a hint for implementing it at the server level.
Are there any additional pointers or specific considerations for the implementation I should be aware of?
At the very least the llama-speculative example has to be fixed first (https://github.com/ggerganov/llama.cpp/issues/10176#issuecomment-2459450448) and then demonstrate some meaningful gains from having this feature implemented in the server.
FWIW, I went to test this a.m. before I went hunting and stumbled into this thread:
encoded 25 tokens in 1.049 seconds, speed: 23.830 t/s
decoded 922 tokens in 60.516 seconds, speed: 15.236 t/s
n_draft = 8
n_predict = 922
n_drafted = 1024
n_accept = 793
accept = 77.441%
draft:
llama_perf_context_print: load time = 1968.76 ms
llama_perf_context_print: prompt eval time = 44285.19 ms / 280 tokens ( 158.16 ms per token, 6.32 tokens per second)
llama_perf_context_print: eval time = 15870.80 ms / 896 runs ( 17.71 ms per token, 56.46 tokens per second)
llama_perf_context_print: total time = 61568.07 ms / 1176 tokens
target:
llama_perf_sampler_print: sampling time = 48.98 ms / 922 runs ( 0.05 ms per token, 18822.47 tokens per second)
llama_perf_context_print: load time = 2290.46 ms
llama_perf_context_print: prompt eval time = 40887.53 ms / 1177 tokens ( 34.74 ms per token, 28.79 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 63536.87 ms / 1178 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating
real 1m6.106s
user 0m1.702s
sys 0m3.050s
(venv) bash-3.2$
using a q4_k_l Qwen2.5-Coder-7B-Instruct draft with a q4_k_l Qwen2.5-coder-32B-Instruct-GGUF main model (bartowski quants from hf)
llama_perf_sampler_print: sampling time = 57.81 ms / 1049 runs ( 0.06 ms per token, 18145.02 tokens per second)
llama_perf_context_print: load time = 1841.75 ms
llama_perf_context_print: prompt eval time = 311.89 ms / 25 tokens ( 12.48 ms per token, 80.16 tokens per second)
llama_perf_context_print: eval time = 99573.78 ms / 1023 runs ( 97.34 ms per token, 10.27 tokens per second)
llama_perf_context_print: total time = 100001.10 ms / 1048 tokens
ggml_metal_free: deallocating
real 1m41.974s
user 0m1.666s
sys 0m1.412s
(venv) bash-3.2$
was perf without the draft model
m3max mbp 128GB.
~53% performance increase when using the draft model, based on time including the double-warmup for the speculative run.
Went immediately to see if I could add on server since I remembered abetlan merging draft model way back when although that required the python bindings, found this thread.
in case I was doing something errant, my CLI:
./llama-speculative -m /var/tmp/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -md /var/tmp/models/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/Qwen2.5-Coder-7B-Instruct-Q4_K_L.gguf -p "# FastAPI app for managing notes. Filenames are annotated as # relative/path/to/file.py\n\n#server/app.py\n" -e -ngl 999 -ngld 999 -c 0 -t 4 -n 1024 --draft 8
and
time ./llama-cli -m /var/tmp/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -p "# FastAPI app for managing notes. Filenames are annotated as # relative/path/to/file.py\n\n#server/app.py\n" -e -ngl 999 -c 0 -t 4 -n 1024
~53% performance increase when using the draft model
Glad to hear this, this is pretty similar to ExllamaV2.
The Qwen 2.5 model family is a good example for this as well, you can basically use the small 1.5b or even 0.5b model for the draft with the big 72b model and get an excellent boost.
I also ran some smaller scale tests, which I wanted to share to bring some additional perspective (this is on an RTX 3060 with 12GB of VRAM, so can't find as large models):
./llama-cli -m llama3.1\:8b-instruct-q8_0 -p "I believe, in one sentence, that the meaning of life is" -ngl 33
llama_perf_sampler_print: sampling time = 601.84 ms / 913 runs ( 0.66 ms per token, 1517.01 tokens per second)
llama_perf_context_print: load time = 3575.03 ms
llama_perf_context_print: prompt eval time = 56.80 ms / 14 tokens ( 4.06 ms per token, 246.46 tokens per second)
llama_perf_context_print: eval time = 27636.04 ms / 898 runs ( 30.78 ms per token, 32.49 tokens per second)
llama_perf_context_print: total time = 29298.72 ms / 912 tokens
./llama-speculative -m llama3.1\:8b-instruct-q8_0 -ngl 33 -md llama3.2\:1b-instruct-q8_0 -ngld 17 -p "I believe, in one sentence, that the meaning of life is"
encoded 14 tokens in 0.054 seconds, speed: 258.250 t/s
decoded 673 tokens in 12.132 seconds, speed: 55.473 t/s
n_draft = 5
n_predict = 673
n_drafted = 730
n_accept = 526
accept = 72.055%
draft:
llama_perf_context_print: load time = 2130.39 ms
llama_perf_context_print: prompt eval time = 7303.87 ms / 305 tokens ( 23.95 ms per token, 41.76 tokens per second)
llama_perf_context_print: eval time = 3685.16 ms / 584 runs ( 6.31 ms per token, 158.47 tokens per second)
llama_perf_context_print: total time = 12191.61 ms / 889 tokens
target:
llama_perf_sampler_print: sampling time = 466.15 ms / 673 runs ( 0.69 ms per token, 1443.75 tokens per second)
llama_perf_context_print: load time = 3573.26 ms
llama_perf_context_print: prompt eval time = 4898.77 ms / 890 tokens ( 5.50 ms per token, 181.68 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 14322.06 ms / 891 tokens
Which, if I read that correctly, bumps the speed from 32.49 t/s to 55.473 t/s, hence a speedup of 70% (excluding model loading times).
Edit: this was on 9fe0fb0.
Any progress on allowing speculative decoding in the server?
It's already supported - this issue hasn't been closed.
https://github.com/ggerganov/llama.cpp/pull/10455