llama.cpp Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used

Name and Version

 ./build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 4338 (7b1ec53f)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

Hello,

Short version :

When using llama-server with only one slot ( --threads-http 1 -np 1) you can sequentially send prompts to process and there will be no speed degradation.

When you use multiple slots (starts showing up from slot # = 3, doesn't show up at slot = 2) generation will be slower and slower after each finished generation.

Used cli :

./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -ngl 99999

Also gave a try with :

--cache-reuse 50000 INEFECTIVE
--defrag-thold 0.0 or --defrag-thold 0.99 INEFECTIVE
--model /opt/IdExtend/models/llm/Mistral-7B-Instruct-v0.3.Q8_0.gguf INEFECTIVE
-sm none INEFECTIVE
--flash-attn --cache-type-k q8_0 --cache-type-v q8_0 INEFECTIVE (was using it from start but decided to reduce to as few args as possible`

Yes I understand I have multiple slots and using them in sequence it dumb, issue is that I tried moving my backend from sequential use to parallel (so I had to create slots) but it doesn't go faster this is why i tried tracking down the issue cause and here I am.

Final run :

./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -sm none -ngl 99999

Python script logs :

[0]Time taken:  1.1400415897369385
[1]Time taken:  0.9648196697235107
[2]Time taken:  1.002309799194336
[3]Time taken:  1.353079080581665
[4]Time taken:  0.8274390697479248
[5]Time taken:  1.4006707668304443
[6]Time taken:  1.5088953971862793
[7]Time taken:  2.5358529090881348
[8]Time taken:  1.6904234886169434
[9]Time taken:  2.6186017990112305
[10]Time taken:  2.290717601776123
[11]Time taken:  2.0220725536346436
[12]Time taken:  1.9455785751342773
[13]Time taken:  3.2140021324157715
[14]Time taken:  2.404296636581421
[15]Time taken:  2.5479960441589355
[16]Time taken:  3.0076818466186523
[17]Time taken:  6.665952205657959
TOTAL Time taken for sequential:  39.140857458114624

You can find a zip with the python script to reproduce it attached.

responses.zip

Full server logs : server-logs.txt

Cleaned server logs :

prompt eval time =     254.31 ms /  2310 tokens (    0.11 ms per token,  9083.40 tokens per second)
       eval time =     879.65 ms /    97 tokens (    9.07 ms per token,   110.27 tokens per second)
      total time =    1133.96 ms /  2407 tokens
prompt eval time =     261.95 ms /  2343 tokens (    0.11 ms per token,  8944.49 tokens per second)
       eval time =     694.21 ms /    85 tokens (    8.17 ms per token,   122.44 tokens per second)
      total time =     956.16 ms /  2428 tokens
prompt eval time =     284.46 ms /  2285 tokens (    0.12 ms per token,  8032.76 tokens per second)
       eval time =     707.39 ms /    80 tokens (    8.84 ms per token,   113.09 tokens per second)
      total time =     991.85 ms /  2365 tokens
prompt eval time =     409.38 ms /  2924 tokens (    0.14 ms per token,  7142.46 tokens per second)
       eval time =     930.37 ms /    95 tokens (    9.79 ms per token,   102.11 tokens per second)
      total time =    1339.75 ms /  3019 tokens
prompt eval time =     357.83 ms /  2282 tokens (    0.16 ms per token,  6377.29 tokens per second)
       eval time =     454.73 ms /    44 tokens (   10.33 ms per token,    96.76 tokens per second)
      total time =     812.57 ms /  2326 tokens
prompt eval time =     388.00 ms /  2277 tokens (    0.17 ms per token,  5868.57 tokens per second)
       eval time =     996.40 ms /    89 tokens (   11.20 ms per token,    89.32 tokens per second)
      total time =    1384.39 ms /  2366 tokens
prompt eval time =     556.35 ms /  3011 tokens (    0.18 ms per token,  5412.09 tokens per second)
       eval time =     930.15 ms /    76 tokens (   12.24 ms per token,    81.71 tokens per second)
      total time =    1486.50 ms /  3087 tokens
prompt eval time =     618.16 ms /  3027 tokens (    0.20 ms per token,  4896.82 tokens per second)
       eval time =    1890.54 ms /   144 tokens (   13.13 ms per token,    76.17 tokens per second)
      total time =    2508.70 ms /  3171 tokens
prompt eval time =     651.99 ms /  2935 tokens (    0.22 ms per token,  4501.60 tokens per second)
       eval time =    1008.49 ms /    72 tokens (   14.01 ms per token,    71.39 tokens per second)
      total time =    1660.48 ms /  3007 tokens
prompt eval time =     903.68 ms /  2957 tokens (    0.31 ms per token,  3272.17 tokens per second)
       eval time =    1681.54 ms /   112 tokens (   15.01 ms per token,    66.61 tokens per second)
      total time =    2585.22 ms /  3069 tokens
prompt eval time =     805.01 ms /  2965 tokens (    0.27 ms per token,  3683.17 tokens per second)
       eval time =    1447.53 ms /    91 tokens (   15.91 ms per token,    62.87 tokens per second)
      total time =    2252.55 ms /  3056 tokens
prompt eval time =     831.70 ms /  2965 tokens (    0.28 ms per token,  3564.97 tokens per second)
       eval time =    1149.78 ms /    69 tokens (   16.66 ms per token,    60.01 tokens per second)
      total time =    1981.48 ms /  3034 tokens
prompt eval time =     996.94 ms /  2940 tokens (    0.34 ms per token,  2949.01 tokens per second)
       eval time =     905.74 ms /    52 tokens (   17.42 ms per token,    57.41 tokens per second)
      total time =    1902.69 ms /  2992 tokens
prompt eval time =     960.80 ms /  3074 tokens (    0.31 ms per token,  3199.42 tokens per second)
       eval time =    2201.62 ms /   118 tokens (   18.66 ms per token,    53.60 tokens per second)
      total time =    3162.42 ms /  3192 tokens
prompt eval time =    1161.53 ms /  2977 tokens (    0.39 ms per token,  2562.99 tokens per second)
       eval time =    1189.15 ms /    62 tokens (   19.18 ms per token,    52.14 tokens per second)
      total time =    2350.68 ms /  3039 tokens
prompt eval time =    1017.35 ms /  2934 tokens (    0.35 ms per token,  2883.97 tokens per second)
       eval time =    1481.01 ms /    76 tokens (   19.49 ms per token,    51.32 tokens per second)
      total time =    2498.35 ms /  3010 tokens
prompt eval time =    1035.18 ms /  2966 tokens (    0.35 ms per token,  2865.20 tokens per second)
       eval time =    1915.50 ms /    97 tokens (   19.75 ms per token,    50.64 tokens per second)
      total time =    2950.68 ms /  3063 tokens
prompt eval time =     638.59 ms /  1778 tokens (    0.36 ms per token,  2784.25 tokens per second)
       eval time =    5996.03 ms /   303 tokens (   19.79 ms per token,    50.53 tokens per second)
      total time =    6634.62 ms /  2081 tokens

First Bad Commit

No response

Relevant log output

No response

Edit :

I gave a try on another machine with build


ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: GRID A100-40C, compute capability 8.0, VMM: no
  Device 1: GRID A100-40C, compute capability 8.0, VMM: no
  Device 2: GRID A100-40C, compute capability 8.0, VMM: no
version: 4149 (1bb30bf2)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Issue persists

Edit: i'm performing a binary search

-> version: 4149 (1bb30bf2) fail ❌

-> version: 4063 (505f3327) fail ❌

-> version: 4024 (329ed914) fail ❌

-> version: 4016 (42cadc74) fail ❌

-> version: 4015 (45950415) no issue ✔️

-> version: 4012 (7554aa46) no issue ✔️

Related PR ~~causing~~ introducing the issue : https://github.com/ggerganov/llama.cpp/pull/10126

I doubt it CREATED the bug, I think it just reveleated the existing bug

The more slots it used, the slowed it is :

Dec 17 '24 07:12 ExtReMLapin

Does the total throughput still increase when you use more slots?

Dec 17 '24 10:12 JohannesGaessler

Hello,

Python code was edited like this (python code in archive posted in first message)

runParallel()
runSequential()

runParallel()
runSequential()

(ran twice to fill/glitch/whatever all slots)

When server is started with 15 slots :

Time taken for parallel:  19.48640537261963
Time taken for sequential:  47.198315382003784
Time taken for parallel:  24.64862847328186
Time taken for sequential:  46.10104751586914

When server is ran with only 1slot :

Time taken for parallel:  27.321959018707275
Time taken for sequential:  18.821592569351196
Time taken for parallel:  17.11741614341736
Time taken for sequential:  17.89683699607849

To answer your question :

No, parallel run in a multislot server is not faster than sequential run in a single slot server Yes, parallel run is a multislot server is faster than sequential run in a multiple slot server because it uses different slots instead of reusing slot with 0.

Sequential speeds are expected to be the same in single or multi slot configuration. But a glitch prevents that.

Parallel speeds in a multi slot server is expected to be faster than sequential single slot server.

1 slot server is always faster than any other slot configuration.

Bonus run as I didn't see the bug in a 2 slots server :

Time taken for parallel:  15.20603609085083
Time taken for sequential:  20.369807481765747
Time taken for parallel:  15.667339563369751
Time taken for sequential:  19.47208571434021

Dec 17 '24 10:12 ExtReMLapin

Context: I did a lot of work on CUDA performance with a focus on a single user/slot. So far I did not prioritize throughput for multiple users/slots. I'm currently working on llama.cpp training/finetuning though and will eventually require more throughput for evaluating model quality post finetuning. So I will likely look into better server throughput in a few months time. I cannot speak for the priorities of other devs though.

Dec 17 '24 12:12 JohannesGaessler

But since you nailed down the problem to a specific commit there is a good chance that it can be fixed. I just meant to say more generally that in the future there will likely be more dev attention on server throughput.

Dec 17 '24 12:12 JohannesGaessler

There is a confusion, it's not this commit that created this bug, it's this commit that easily revealed it, because before that if was only using the first slot. And as soon as you use more slot (even before that), performance were going down.

Dec 17 '24 12:12 ExtReMLapin

Context: I did a lot of work on CUDA performance with a focus on a single user/slot. So far I did not prioritize throughput for multiple users/slots. I'm currently working on llama.cpp training/finetuning though and will eventually require more throughput for evaluating model quality post finetuning. So I will likely look into better server throughput in a few months time. I cannot speak for the priorities of other devs though.

I could be terribly wrong, but isn't batch processing supposed to provide higher total token/s throuput than single slot ?

For example, with one prompt you will get 150 t/s but at 10 you will get 100 t/s per prompts which implies 1000 t/s in total, which is faster at the end of the day, right ?

Dec 17 '24 12:12 ExtReMLapin

This is expected - it's a side effect of the unified KV cache. Effectively, all slots keep their context in the common context, so with each requests the KV cache grows. This will be fixed after we refactor the implementation to support parallel-friendly KV cache. For now, don't use more than 4 slots and use -dt 0.1.

Dec 17 '24 12:12 ggerganov

Hello and thanks for the answer Ggerganov.

For later usage, I ran more test :

4 slots server + short prompt + low n_predict leads to no issue and everything works well

As soon as I use a longer prompt, or more than 4 slots or higher n_predict performances goes down.

Dec 17 '24 13:12 ExtReMLapin

I also ran into the problem of speed degradation when using multiple slots. Is it possible as a temporary solution to clear the kv cache after the slot finishes generating? This can be added as an additional parameter.

Dec 21 '24 11:12 elter-tef

The problem is we don't know when the slot has finished.

Dec 21 '24 14:12 ggerganov

stale but not fixed ! (just in case it auto closes the issue) AFAIK

Jan 27 '25 07:01 ExtReMLapin

#11213 will be a major step towards resolving this. It's one of the higher priorities now.

Jan 27 '25 07:01 ggerganov

This issue was closed because it has been inactive for 14 days since being marked as stale.

Mar 13 '25 01:03 github-actions[bot]

Bad bot !

Mar 13 '25 03:03 ExtReMLapin

Didn't take the time to re-re-run tests but I doubt it was fixed in #12181 as it was only a refactor, is it possible to reopen ?

Mar 26 '25 05:03 ExtReMLapin

Thanks! Great jobs ! Is there any progress to find any solution to this problem? I have the same problem,generation speed is degraded after each generation, even I inferred like this :

./build/bin/llama-server -m ../models/models--bartowski--Qwen2.5-14B-Instruct-1M-GGUF/snapshots/f9f82825ed669910c9083619190a2931eec1c980/Qwen2.5-14B-Instruct-1M-Q4_K_M.gguf --host 0.0.0.0 --port 10050 --flash-attn --chat-template chatml --parallel 4 --seed 100 --ctx-size 65536 -ngl 9999 --cont-batching --threads-http 16 --split-mode layer

It would be kind of anyone tell me how to solve this problem.

Apr 25 '25 02:04 hzgdeer

Or anybody find some Temporary solution ? I feel really troubled because of this issue。

Apr 25 '25 02:04 hzgdeer

No fix on my end, as far as I know vllm doesn't suffer from this specific issue but it lacks proper blackwell support, propper gguf support and speed is much lower on single user/splitted gpu mode.

https://github.com/ggml-org/llama.cpp/pull/12799 could be another step in fixing it but not going to lie I don't really understand anything about that

Apr 25 '25 03:04 ExtReMLapin

beep boop not stale !

Jun 03 '25 07:06 ExtReMLapin

Should be fixed in #14363

HOORAYYY

Jul 16 '25 13:07 ExtReMLapin