Misc. bug: [SERVER] Multiple slots, generation speed is degraded after each generation/slot used
Name and Version
./build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 4338 (7b1ec53f)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
Hello,
Short version :
When using llama-server with only one slot ( --threads-http 1 -np 1) you can sequentially send prompts to process and there will be no speed degradation.
When you use multiple slots (starts showing up from slot # = 3, doesn't show up at slot = 2) generation will be slower and slower after each finished generation.
Used cli :
./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -ngl 99999
Also gave a try with :
--cache-reuse 50000INEFECTIVE--defrag-thold 0.0or--defrag-thold 0.99INEFECTIVE--model /opt/IdExtend/models/llm/Mistral-7B-Instruct-v0.3.Q8_0.ggufINEFECTIVE-sm noneINEFECTIVE--flash-attn --cache-type-k q8_0 --cache-type-v q8_0INEFECTIVE (was using it from start but decided to reduce to as few args as possible`
Yes I understand I have multiple slots and using them in sequence it dumb, issue is that I tried moving my backend from sequential use to parallel (so I had to create slots) but it doesn't go faster this is why i tried tracking down the issue cause and here I am.
Final run :
./build/bin/llama-server --host 0.0.0.0 --port 8080 --model /opt/IdExtend/models/llm/Qwen2.5-7B-Instruct-Q4_K_M.gguf --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 122880 --threads-http 15 -np 15 --tensor-split 1.0,0.0,0.0 -sm none -ngl 99999
Python script logs :
[0]Time taken: 1.1400415897369385
[1]Time taken: 0.9648196697235107
[2]Time taken: 1.002309799194336
[3]Time taken: 1.353079080581665
[4]Time taken: 0.8274390697479248
[5]Time taken: 1.4006707668304443
[6]Time taken: 1.5088953971862793
[7]Time taken: 2.5358529090881348
[8]Time taken: 1.6904234886169434
[9]Time taken: 2.6186017990112305
[10]Time taken: 2.290717601776123
[11]Time taken: 2.0220725536346436
[12]Time taken: 1.9455785751342773
[13]Time taken: 3.2140021324157715
[14]Time taken: 2.404296636581421
[15]Time taken: 2.5479960441589355
[16]Time taken: 3.0076818466186523
[17]Time taken: 6.665952205657959
TOTAL Time taken for sequential: 39.140857458114624
You can find a zip with the python script to reproduce it attached.
Full server logs : server-logs.txt
Cleaned server logs :
prompt eval time = 254.31 ms / 2310 tokens ( 0.11 ms per token, 9083.40 tokens per second)
eval time = 879.65 ms / 97 tokens ( 9.07 ms per token, 110.27 tokens per second)
total time = 1133.96 ms / 2407 tokens
prompt eval time = 261.95 ms / 2343 tokens ( 0.11 ms per token, 8944.49 tokens per second)
eval time = 694.21 ms / 85 tokens ( 8.17 ms per token, 122.44 tokens per second)
total time = 956.16 ms / 2428 tokens
prompt eval time = 284.46 ms / 2285 tokens ( 0.12 ms per token, 8032.76 tokens per second)
eval time = 707.39 ms / 80 tokens ( 8.84 ms per token, 113.09 tokens per second)
total time = 991.85 ms / 2365 tokens
prompt eval time = 409.38 ms / 2924 tokens ( 0.14 ms per token, 7142.46 tokens per second)
eval time = 930.37 ms / 95 tokens ( 9.79 ms per token, 102.11 tokens per second)
total time = 1339.75 ms / 3019 tokens
prompt eval time = 357.83 ms / 2282 tokens ( 0.16 ms per token, 6377.29 tokens per second)
eval time = 454.73 ms / 44 tokens ( 10.33 ms per token, 96.76 tokens per second)
total time = 812.57 ms / 2326 tokens
prompt eval time = 388.00 ms / 2277 tokens ( 0.17 ms per token, 5868.57 tokens per second)
eval time = 996.40 ms / 89 tokens ( 11.20 ms per token, 89.32 tokens per second)
total time = 1384.39 ms / 2366 tokens
prompt eval time = 556.35 ms / 3011 tokens ( 0.18 ms per token, 5412.09 tokens per second)
eval time = 930.15 ms / 76 tokens ( 12.24 ms per token, 81.71 tokens per second)
total time = 1486.50 ms / 3087 tokens
prompt eval time = 618.16 ms / 3027 tokens ( 0.20 ms per token, 4896.82 tokens per second)
eval time = 1890.54 ms / 144 tokens ( 13.13 ms per token, 76.17 tokens per second)
total time = 2508.70 ms / 3171 tokens
prompt eval time = 651.99 ms / 2935 tokens ( 0.22 ms per token, 4501.60 tokens per second)
eval time = 1008.49 ms / 72 tokens ( 14.01 ms per token, 71.39 tokens per second)
total time = 1660.48 ms / 3007 tokens
prompt eval time = 903.68 ms / 2957 tokens ( 0.31 ms per token, 3272.17 tokens per second)
eval time = 1681.54 ms / 112 tokens ( 15.01 ms per token, 66.61 tokens per second)
total time = 2585.22 ms / 3069 tokens
prompt eval time = 805.01 ms / 2965 tokens ( 0.27 ms per token, 3683.17 tokens per second)
eval time = 1447.53 ms / 91 tokens ( 15.91 ms per token, 62.87 tokens per second)
total time = 2252.55 ms / 3056 tokens
prompt eval time = 831.70 ms / 2965 tokens ( 0.28 ms per token, 3564.97 tokens per second)
eval time = 1149.78 ms / 69 tokens ( 16.66 ms per token, 60.01 tokens per second)
total time = 1981.48 ms / 3034 tokens
prompt eval time = 996.94 ms / 2940 tokens ( 0.34 ms per token, 2949.01 tokens per second)
eval time = 905.74 ms / 52 tokens ( 17.42 ms per token, 57.41 tokens per second)
total time = 1902.69 ms / 2992 tokens
prompt eval time = 960.80 ms / 3074 tokens ( 0.31 ms per token, 3199.42 tokens per second)
eval time = 2201.62 ms / 118 tokens ( 18.66 ms per token, 53.60 tokens per second)
total time = 3162.42 ms / 3192 tokens
prompt eval time = 1161.53 ms / 2977 tokens ( 0.39 ms per token, 2562.99 tokens per second)
eval time = 1189.15 ms / 62 tokens ( 19.18 ms per token, 52.14 tokens per second)
total time = 2350.68 ms / 3039 tokens
prompt eval time = 1017.35 ms / 2934 tokens ( 0.35 ms per token, 2883.97 tokens per second)
eval time = 1481.01 ms / 76 tokens ( 19.49 ms per token, 51.32 tokens per second)
total time = 2498.35 ms / 3010 tokens
prompt eval time = 1035.18 ms / 2966 tokens ( 0.35 ms per token, 2865.20 tokens per second)
eval time = 1915.50 ms / 97 tokens ( 19.75 ms per token, 50.64 tokens per second)
total time = 2950.68 ms / 3063 tokens
prompt eval time = 638.59 ms / 1778 tokens ( 0.36 ms per token, 2784.25 tokens per second)
eval time = 5996.03 ms / 303 tokens ( 19.79 ms per token, 50.53 tokens per second)
total time = 6634.62 ms / 2081 tokens
First Bad Commit
No response
Relevant log output
No response
Edit :
I gave a try on another machine with build
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: GRID A100-40C, compute capability 8.0, VMM: no
Device 1: GRID A100-40C, compute capability 8.0, VMM: no
Device 2: GRID A100-40C, compute capability 8.0, VMM: no
version: 4149 (1bb30bf2)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Issue persists
Edit: i'm performing a binary search
-> version: 4149 (1bb30bf2) fail ❌
-> version: 4063 (505f3327) fail ❌
-> version: 4024 (329ed914) fail ❌
-> version: 4016 (42cadc74) fail ❌
-> version: 4015 (45950415) no issue ✔️
-> version: 4012 (7554aa46) no issue ✔️
Related PR ~~causing~~ introducing the issue : https://github.com/ggerganov/llama.cpp/pull/10126
I doubt it CREATED the bug, I think it just reveleated the existing bug
The more slots it used, the slowed it is :
Does the total throughput still increase when you use more slots?
Hello,
Python code was edited like this (python code in archive posted in first message)
runParallel()
runSequential()
runParallel()
runSequential()
(ran twice to fill/glitch/whatever all slots)
When server is started with 15 slots :
Time taken for parallel: 19.48640537261963
Time taken for sequential: 47.198315382003784
Time taken for parallel: 24.64862847328186
Time taken for sequential: 46.10104751586914
When server is ran with only 1slot :
Time taken for parallel: 27.321959018707275
Time taken for sequential: 18.821592569351196
Time taken for parallel: 17.11741614341736
Time taken for sequential: 17.89683699607849
To answer your question :
No, parallel run in a multislot server is not faster than sequential run in a single slot server Yes, parallel run is a multislot server is faster than sequential run in a multiple slot server because it uses different slots instead of reusing slot with 0.
Sequential speeds are expected to be the same in single or multi slot configuration. But a glitch prevents that.
Parallel speeds in a multi slot server is expected to be faster than sequential single slot server.
1 slot server is always faster than any other slot configuration.
Bonus run as I didn't see the bug in a 2 slots server :
Time taken for parallel: 15.20603609085083
Time taken for sequential: 20.369807481765747
Time taken for parallel: 15.667339563369751
Time taken for sequential: 19.47208571434021
Context: I did a lot of work on CUDA performance with a focus on a single user/slot. So far I did not prioritize throughput for multiple users/slots. I'm currently working on llama.cpp training/finetuning though and will eventually require more throughput for evaluating model quality post finetuning. So I will likely look into better server throughput in a few months time. I cannot speak for the priorities of other devs though.
But since you nailed down the problem to a specific commit there is a good chance that it can be fixed. I just meant to say more generally that in the future there will likely be more dev attention on server throughput.
There is a confusion, it's not this commit that created this bug, it's this commit that easily revealed it, because before that if was only using the first slot. And as soon as you use more slot (even before that), performance were going down.
Context: I did a lot of work on CUDA performance with a focus on a single user/slot. So far I did not prioritize throughput for multiple users/slots. I'm currently working on llama.cpp training/finetuning though and will eventually require more throughput for evaluating model quality post finetuning. So I will likely look into better server throughput in a few months time. I cannot speak for the priorities of other devs though.
I could be terribly wrong, but isn't batch processing supposed to provide higher total token/s throuput than single slot ?
For example, with one prompt you will get 150 t/s but at 10 you will get 100 t/s per prompts which implies 1000 t/s in total, which is faster at the end of the day, right ?
This is expected - it's a side effect of the unified KV cache. Effectively, all slots keep their context in the common context, so with each requests the KV cache grows. This will be fixed after we refactor the implementation to support parallel-friendly KV cache. For now, don't use more than 4 slots and use -dt 0.1.
Hello and thanks for the answer Ggerganov.
For later usage, I ran more test :
4 slots server + short prompt + low n_predict leads to no issue and everything works well
As soon as I use a longer prompt, or more than 4 slots or higher n_predict performances goes down.
I also ran into the problem of speed degradation when using multiple slots. Is it possible as a temporary solution to clear the kv cache after the slot finishes generating? This can be added as an additional parameter.
The problem is we don't know when the slot has finished.
stale but not fixed ! (just in case it auto closes the issue) AFAIK
#11213 will be a major step towards resolving this. It's one of the higher priorities now.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Bad bot !
Didn't take the time to re-re-run tests but I doubt it was fixed in #12181 as it was only a refactor, is it possible to reopen ?
Thanks! Great jobs ! Is there any progress to find any solution to this problem? I have the same problem,generation speed is degraded after each generation, even I inferred like this :
./build/bin/llama-server -m ../models/models--bartowski--Qwen2.5-14B-Instruct-1M-GGUF/snapshots/f9f82825ed669910c9083619190a2931eec1c980/Qwen2.5-14B-Instruct-1M-Q4_K_M.gguf --host 0.0.0.0 --port 10050 --flash-attn --chat-template chatml --parallel 4 --seed 100 --ctx-size 65536 -ngl 9999 --cont-batching --threads-http 16 --split-mode layer
It would be kind of anyone tell me how to solve this problem.
Or anybody find some Temporary solution ? I feel really troubled because of this issue。
No fix on my end, as far as I know vllm doesn't suffer from this specific issue but it lacks proper blackwell support, propper gguf support and speed is much lower on single user/splitted gpu mode.
https://github.com/ggml-org/llama.cpp/pull/12799 could be another step in fixing it but not going to lie I don't really understand anything about that
beep boop not stale !
Should be fixed in #14363
HOORAYYY