llama.cpp
llama.cpp copied to clipboard
server: stop generation at `n_ctx_train` if `n_predict` is not set
Context
If the model hallucinates (EOS-less generation), server will go to infinite loop if n_predict
is not set.
It is a wrong usage of the server or the model:
- https://github.com/ggerganov/llama.cpp/blob/master/examples/server/tests/features/wrong_usages.feature
- #3969
- https://github.com/ggerganov/llama.cpp/issues/6617#issuecomment-2051571015
But as it brings confusion, I propose to stop the generation at the size of the context which with the model was trained if self-extent context is disabled.
Tests
server --hf-repo ggml-org/models --hf-file phi-2/ggml-model-q4_0.gguf --model phi-2-q4_0.gguf --parallel 1 --ctx-size 4096 --log-format text -ngl 33
curl http://localhost:8080/completion --data '{"prompt": "hallucinate"}'
WARN [ update_slots] n_predict is not set and self-context extend is disabled. Limiting generated tokens to n_ctx_train to avoid EOS-less generation infinite loop | tid="127864424759296" timestamp=1713526370 params.n_predict=-1 slot.n_predict=-1 slot.n_decoded=2048 n_slots=1 n_ctx=4096 n_ctx_train=2048 ga_n=1
Before we make the change, we should see if the update_slots
error is really reproducible. If it is, it's a bug that we first need to fix. If we merge the change now, we might hide the underlying issue
@ggerganov I have created a web application to stress-test the server and see how it handles multiple clients sending random questions and documents simultaneously. I tested it with four clients using mixtral 8x7B q8_0 in x3 RTX 3090 for one hour, and the server didn't encounter any issues.
master:
But the gg/flash-attn
branch sometimes generates NaNs in the query tensor, which are not generated by the flash attention kernel (this happens before the flash attention kernel).
I will keep investigating for the meantime.
Agreed, I am doing performance and capacity tests since 2 month+, there is no such bug. The server is stable and production ready.
Alongside an optional cap, I think we should make the server stop generating when the connection is closed for whatever reason (clients may well have a timeout or interrupt things manually, but the server keeps going / stays busy needlessly). Maybe an interrupt check callback to call before generating each token?
Yeah, it is identified in:
- #6421
Before we make the change, we should see if the
update_slots
error is really reproducible.
We can conclude that the user was using an old version. That's it.
@ggerganov, finally, I would prefer not to go this way but to stop the generation at n_ctx
with a warning, instead of printing a warning each time if n_predict
is not set.
Ok. I'm not sure we have to da anything at this point - seems the latest version work OK
There should still be some limit to avoid getting into an infinite loop in the server.
@ggerganov @slaren please have a look to this proposal
When this happens, the response of /completion
has these fields:
"truncated": true,
"stopped_eos": false,
"stopped_word": false,
"stopped_limit": false,
I am not familiar with the meaning of each of these flags. Should this be different? Maybe stopped_limit
should be true to indicate that a limit was hit?
Maybe it would be simpler to set n_predict
to n_ctx_train
by default if not set in the request.
Maybe it would be simpler to set
n_predict
ton_ctx_train
by default if not set in the request.
Yeah, it was the first version, but I feel it noisy to log this warning at each request: 6fd5ad5
That's not exactly what I mean. Basically I would just change the default to n_ctx_train
(or other value) in the line that sets slot.params.n_predict = json_value(data, "n_predict", default_params.n_predict);
. No need to print any warnings, just document it. The user can check stopped_limit
or set a different limit if needed.
I see, I am OK with both solutions even if it will be sort of a breaking change to set n_predict all the time.
AFAIK not all models hallucinate and not on all completion, plus normally it should always emmit EOS token if the trained chat template is in used in chat completion endpoint.
@ggerganov up to you, but we need to stop this infinite loop recurrent concern some way.
📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2
-q4_0
: 460 iterations 🚀
Expand details for performance related PR only
- Concurrent users: 8, duration: 10m
- HTTP request : avg=10248.24ms p(95)=27332.83ms fails=, finish reason: stop=410 truncated=50
- Prompt processing (pp): avg=118.34tk/s p(95)=535.67tk/s
- Token generation (tg): avg=24.58tk/s p(95)=38.21tk/s
- ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=hp/server/avoid-infinite-loop commit=6c257f4709b1848a8d7bd73daf95aec763e7a4f5
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 460 iterations"
y-axis "llamacpp:prompt_tokens_seconds"
x-axis "llamacpp:prompt_tokens_seconds" 1713554372 --> 1713554996
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 477.91, 477.91, 477.91, 477.91, 477.91, 670.31, 670.31, 670.31, 670.31, 670.31, 692.38, 692.38, 692.38, 692.38, 692.38, 708.58, 708.58, 708.58, 708.58, 708.58, 732.14, 732.14, 732.14, 732.14, 732.14, 731.3, 731.3, 731.3, 731.3, 731.3, 727.33, 727.33, 727.33, 727.33, 727.33, 738.22, 738.22, 738.22, 738.22, 738.22, 749.17, 749.17, 749.17, 749.17, 749.17, 746.54, 746.54, 746.54, 746.54, 746.54, 740.8, 740.8, 740.8, 740.8, 740.8, 738.33, 738.33, 738.33, 738.33, 738.33, 736.42, 736.42, 736.42, 736.42, 736.42, 734.37, 734.37, 734.37, 734.37, 734.37, 751.9, 751.9, 751.9, 751.9, 751.9, 750.18, 750.18, 750.18, 750.18, 750.18, 770.76, 770.76, 770.76, 770.76, 770.76, 772.33, 772.33, 772.33, 772.33, 772.33, 777.22, 777.22, 777.22, 777.22, 777.22, 775.95, 775.95, 775.95, 775.95, 775.95, 772.96, 772.96, 772.96, 772.96, 772.96, 748.52, 748.52, 748.52, 748.52, 748.52, 747.56, 747.56, 747.56, 747.56, 747.56, 745.16, 745.16, 745.16, 745.16, 745.16, 742.85, 742.85, 742.85, 742.85, 742.85, 745.24, 745.24, 745.24, 745.24, 745.24, 747.79, 747.79, 747.79, 747.79, 747.79, 716.22, 716.22, 716.22, 716.22, 716.22, 693.58, 693.58, 693.58, 693.58, 693.58, 693.52, 693.52, 693.52, 693.52, 693.52, 694.83, 694.83, 694.83, 694.83, 694.83, 700.07, 700.07, 700.07, 700.07, 700.07, 696.1, 696.1, 696.1, 696.1, 696.1, 695.16, 695.16, 695.16, 695.16, 695.16, 697.11, 697.11, 697.11, 697.11, 697.11, 701.41, 701.41, 701.41, 701.41, 701.41, 701.48, 701.48, 701.48, 701.48, 701.48, 701.41, 701.41, 701.41, 701.41, 701.41, 703.96, 703.96, 703.96, 703.96, 703.96, 707.57, 707.57, 707.57, 707.57, 707.57, 714.1, 714.1, 714.1, 714.1, 714.1, 719.5, 719.5, 719.5, 719.5, 719.5, 724.18, 724.18, 724.18, 724.18, 724.18, 723.07, 723.07, 723.07, 723.07, 723.07, 721.5, 721.5, 721.5, 721.5, 721.5, 720.37, 720.37, 720.37, 720.37, 720.37, 722.7, 722.7, 722.7, 722.7, 722.7, 722.85, 722.85, 722.85, 722.85, 722.85, 717.42, 717.42, 717.42, 717.42, 717.42, 709.26, 709.26, 709.26, 709.26, 709.26, 705.94, 705.94, 705.94, 705.94, 705.94, 703.35, 703.35, 703.35, 703.35, 703.35, 702.98, 702.98, 702.98, 702.98, 702.98, 701.05, 701.05, 701.05, 701.05, 701.05, 703.23, 703.23, 703.23, 703.23, 703.23, 705.4, 705.4, 705.4, 705.4, 705.4, 704.79, 704.79, 704.79, 704.79, 704.79, 704.84, 704.84, 704.84, 704.84, 704.84, 708.25, 708.25, 708.25, 708.25, 708.25, 708.64, 708.64, 708.64, 708.64, 708.64, 710.84, 710.84, 710.84]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 460 iterations"
y-axis "llamacpp:predicted_tokens_seconds"
x-axis "llamacpp:predicted_tokens_seconds" 1713554372 --> 1713554996
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.52, 38.52, 38.52, 38.52, 38.52, 18.73, 18.73, 18.73, 18.73, 18.73, 19.49, 19.49, 19.49, 19.49, 19.49, 24.26, 24.26, 24.26, 24.26, 24.26, 24.04, 24.04, 24.04, 24.04, 24.04, 24.16, 24.16, 24.16, 24.16, 24.16, 24.56, 24.56, 24.56, 24.56, 24.56, 25.49, 25.49, 25.49, 25.49, 25.49, 25.54, 25.54, 25.54, 25.54, 25.54, 25.59, 25.59, 25.59, 25.59, 25.59, 25.08, 25.08, 25.08, 25.08, 25.08, 24.64, 24.64, 24.64, 24.64, 24.64, 24.24, 24.24, 24.24, 24.24, 24.24, 24.11, 24.11, 24.11, 24.11, 24.11, 23.83, 23.83, 23.83, 23.83, 23.83, 23.24, 23.24, 23.24, 23.24, 23.24, 22.88, 22.88, 22.88, 22.88, 22.88, 22.42, 22.42, 22.42, 22.42, 22.42, 22.32, 22.32, 22.32, 22.32, 22.32, 22.47, 22.47, 22.47, 22.47, 22.47, 22.64, 22.64, 22.64, 22.64, 22.64, 22.63, 22.63, 22.63, 22.63, 22.63, 22.4, 22.4, 22.4, 22.4, 22.4, 22.31, 22.31, 22.31, 22.31, 22.31, 22.19, 22.19, 22.19, 22.19, 22.19, 22.01, 22.01, 22.01, 22.01, 22.01, 22.18, 22.18, 22.18, 22.18, 22.18, 22.33, 22.33, 22.33, 22.33, 22.33, 22.49, 22.49, 22.49, 22.49, 22.49, 22.56, 22.56, 22.56, 22.56, 22.56, 22.6, 22.6, 22.6, 22.6, 22.6, 22.66, 22.66, 22.66, 22.66, 22.66, 22.4, 22.4, 22.4, 22.4, 22.4, 22.44, 22.44, 22.44, 22.44, 22.44, 22.8, 22.8, 22.8, 22.8, 22.8, 22.89, 22.89, 22.89, 22.89, 22.89, 23.01, 23.01, 23.01, 23.01, 23.01, 23.13, 23.13, 23.13, 23.13, 23.13, 23.22, 23.22, 23.22, 23.22, 23.22, 23.23, 23.23, 23.23, 23.23, 23.23, 23.18, 23.18, 23.18, 23.18, 23.18, 23.11, 23.11, 23.11, 23.11, 23.11, 23.11, 23.11, 23.11, 23.11, 23.11, 22.79, 22.79, 22.79, 22.79, 22.79, 22.75, 22.75, 22.75, 22.75, 22.75, 22.85, 22.85, 22.85, 22.85, 22.85, 22.99, 22.99, 22.99, 22.99, 22.99, 23.03, 23.03, 23.03, 23.03, 23.03, 23.11, 23.11, 23.11, 23.11, 23.11, 23.12, 23.12, 23.12, 23.12, 23.12, 22.75, 22.75, 22.75, 22.75, 22.75, 22.62, 22.62, 22.62, 22.62, 22.62, 22.62, 22.62, 22.62, 22.62, 22.62, 22.17, 22.17, 22.17, 22.17, 22.17, 21.53, 21.53, 21.53, 21.53, 21.53, 21.46, 21.46, 21.46, 21.46, 21.46, 21.46, 21.46, 21.46, 21.46, 21.46, 21.55, 21.55, 21.55, 21.55, 21.55, 21.56, 21.56, 21.56, 21.56, 21.56, 21.65, 21.65, 21.65, 21.65, 21.65, 21.73, 21.73, 21.73]
Details
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 460 iterations"
y-axis "llamacpp:kv_cache_usage_ratio"
x-axis "llamacpp:kv_cache_usage_ratio" 1713554372 --> 1713554996
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.12, 0.12, 0.12, 0.12, 0.12, 0.24, 0.24, 0.24, 0.24, 0.24, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.31, 0.31, 0.31, 0.31, 0.31, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.28, 0.28, 0.28, 0.28, 0.28, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.33, 0.33, 0.33, 0.33, 0.33, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.29, 0.29, 0.29, 0.29, 0.29, 0.45, 0.45, 0.45, 0.45, 0.45, 0.43, 0.43, 0.43, 0.43, 0.43, 0.46, 0.46, 0.46, 0.46, 0.46, 0.49, 0.49, 0.49, 0.49, 0.49, 0.39, 0.39, 0.39, 0.39, 0.39, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 460 iterations"
y-axis "llamacpp:requests_processing"
x-axis "llamacpp:requests_processing" 1713554372 --> 1713554996
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]
This would be simple if context shifting was opt-in, then there would always be a hard limit of n_ctx
tokens. I am not sure that enabling context shift by default and without a way to disable it is a good idea, it can result in very poor generation quality and it is not the normal behavior for LLM providers, as far as I know.
This would be simple if context shifting was opt-in, then there would always be a hard limit of
n_ctx
tokens. I am not sure that enabling context shift by default and without a way to disable it is a good idea, it can result in very poor generation quality and it is not the normal behavior for LLM providers, as far as I know.
Oh yes, and it is so slow in the current implementation, blocking the whole server.
@ggerganov up to you, but we need to stop this infinite loop recurrent concern some way.
@ggerganov I think with the removal of hard coded stop tokens, this PR is becoming more important
Maybe it would be simpler to set
n_predict
ton_ctx_train
by default if not set in the request.
Yes, let's do that. Context-shift has to be refactored and become optional (in a future PR)
Maybe it would be simpler to set
n_predict
ton_ctx_train
by default if not set in the request.Yes, let's do that. Context-shift has to be refactored and become optional (in a future PR)
@ggerganov @slaren Finally I prefer to keep checking at each token if we do not exceed n_ctx_train
because one can simply pass n_predict
= -1 in the request payload and the server will still go to infinite loop. I feel this approach safer with a proper warning.