llama.cpp server: stop generation at `n_ctx_train` if `n

Context

If the model hallucinates (EOS-less generation), server will go to infinite loop if n_predict is not set.

It is a wrong usage of the server or the model:

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/tests/features/wrong_usages.feature
#3969
https://github.com/ggerganov/llama.cpp/issues/6617#issuecomment-2051571015

But as it brings confusion, I propose to stop the generation at the size of the context which with the model was trained if self-extent context is disabled.

Tests

server --hf-repo ggml-org/models --hf-file phi-2/ggml-model-q4_0.gguf --model phi-2-q4_0.gguf --parallel 1 --ctx-size 4096 --log-format text -ngl 33

curl http://localhost:8080/completion --data '{"prompt": "hallucinate"}'

WARN [ update_slots] n_predict is not set and self-context extend is disabled. Limiting generated tokens to n_ctx_train to avoid EOS-less generation infinite loop | tid="127864424759296" timestamp=1713526370 params.n_predict=-1 slot.n_predict=-1 slot.n_decoded=2048 n_slots=1 n_ctx=4096 n_ctx_train=2048 ga_n=1

Apr 12 '24 11:04 phymbert

Before we make the change, we should see if the update_slots error is really reproducible. If it is, it's a bug that we first need to fix. If we merge the change now, we might hide the underlying issue

Apr 12 '24 11:04 ggerganov

@ggerganov I have created a web application to stress-test the server and see how it handles multiple clients sending random questions and documents simultaneously. I tested it with four clients using mixtral 8x7B q8_0 in x3 RTX 3090 for one hour, and the server didn't encounter any issues.

master:

Screenshot 2024-04-12 154055

But the gg/flash-attn branch sometimes generates NaNs in the query tensor, which are not generated by the flash attention kernel (this happens before the flash attention kernel).

Screenshot 2024-04-09 222833

I will keep investigating for the meantime.

Apr 13 '24 00:04 FSSRepo

Agreed, I am doing performance and capacity tests since 2 month+, there is no such bug. The server is stable and production ready.

Apr 13 '24 05:04 phymbert

Alongside an optional cap, I think we should make the server stop generating when the connection is closed for whatever reason (clients may well have a timeout or interrupt things manually, but the server keeps going / stays busy needlessly). Maybe an interrupt check callback to call before generating each token?

Apr 13 '24 13:04 ochafik

Yeah, it is identified in:

#6421

Apr 13 '24 14:04 phymbert

Before we make the change, we should see if the update_slots error is really reproducible.

We can conclude that the user was using an old version. That's it.

Apr 16 '24 06:04 phymbert

@ggerganov, finally, I would prefer not to go this way but to stop the generation at n_ctx with a warning, instead of printing a warning each time if n_predict is not set.

Apr 19 '24 07:04 phymbert

Ok. I'm not sure we have to da anything at this point - seems the latest version work OK

Apr 19 '24 07:04 ggerganov

There should still be some limit to avoid getting into an infinite loop in the server.

Apr 19 '24 10:04 slaren

@ggerganov @slaren please have a look to this proposal

Apr 19 '24 11:04 phymbert

When this happens, the response of /completion has these fields:

  "truncated": true,
  "stopped_eos": false,
  "stopped_word": false,
  "stopped_limit": false,

I am not familiar with the meaning of each of these flags. Should this be different? Maybe stopped_limit should be true to indicate that a limit was hit?

Apr 19 '24 11:04 slaren

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

Apr 19 '24 18:04 slaren

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

Yeah, it was the first version, but I feel it noisy to log this warning at each request: 6fd5ad5

Apr 19 '24 19:04 phymbert

That's not exactly what I mean. Basically I would just change the default to n_ctx_train (or other value) in the line that sets slot.params.n_predict = json_value(data, "n_predict", default_params.n_predict);. No need to print any warnings, just document it. The user can check stopped_limit or set a different limit if needed.

Apr 19 '24 19:04 slaren

I see, I am OK with both solutions even if it will be sort of a breaking change to set n_predict all the time.

AFAIK not all models hallucinate and not on all completion, plus normally it should always emmit EOS token if the trained chat template is in used in chat completion endpoint.

@ggerganov up to you, but we need to stop this infinite loop recurrent concern some way.

Apr 19 '24 19:04 phymbert

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 460 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=10248.24ms p(95)=27332.83ms fails=, finish reason: stop=410 truncated=50
Prompt processing (pp): avg=118.34tk/s p(95)=535.67tk/s
Token generation (tg): avg=24.58tk/s p(95)=38.21tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=hp/server/avoid-infinite-loop commit=6c257f4709b1848a8d7bd73daf95aec763e7a4f5

prompt_tokens_seconds

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 460 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713554372 --> 1713554996
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 477.91, 477.91, 477.91, 477.91, 477.91, 670.31, 670.31, 670.31, 670.31, 670.31, 692.38, 692.38, 692.38, 692.38, 692.38, 708.58, 708.58, 708.58, 708.58, 708.58, 732.14, 732.14, 732.14, 732.14, 732.14, 731.3, 731.3, 731.3, 731.3, 731.3, 727.33, 727.33, 727.33, 727.33, 727.33, 738.22, 738.22, 738.22, 738.22, 738.22, 749.17, 749.17, 749.17, 749.17, 749.17, 746.54, 746.54, 746.54, 746.54, 746.54, 740.8, 740.8, 740.8, 740.8, 740.8, 738.33, 738.33, 738.33, 738.33, 738.33, 736.42, 736.42, 736.42, 736.42, 736.42, 734.37, 734.37, 734.37, 734.37, 734.37, 751.9, 751.9, 751.9, 751.9, 751.9, 750.18, 750.18, 750.18, 750.18, 750.18, 770.76, 770.76, 770.76, 770.76, 770.76, 772.33, 772.33, 772.33, 772.33, 772.33, 777.22, 777.22, 777.22, 777.22, 777.22, 775.95, 775.95, 775.95, 775.95, 775.95, 772.96, 772.96, 772.96, 772.96, 772.96, 748.52, 748.52, 748.52, 748.52, 748.52, 747.56, 747.56, 747.56, 747.56, 747.56, 745.16, 745.16, 745.16, 745.16, 745.16, 742.85, 742.85, 742.85, 742.85, 742.85, 745.24, 745.24, 745.24, 745.24, 745.24, 747.79, 747.79, 747.79, 747.79, 747.79, 716.22, 716.22, 716.22, 716.22, 716.22, 693.58, 693.58, 693.58, 693.58, 693.58, 693.52, 693.52, 693.52, 693.52, 693.52, 694.83, 694.83, 694.83, 694.83, 694.83, 700.07, 700.07, 700.07, 700.07, 700.07, 696.1, 696.1, 696.1, 696.1, 696.1, 695.16, 695.16, 695.16, 695.16, 695.16, 697.11, 697.11, 697.11, 697.11, 697.11, 701.41, 701.41, 701.41, 701.41, 701.41, 701.48, 701.48, 701.48, 701.48, 701.48, 701.41, 701.41, 701.41, 701.41, 701.41, 703.96, 703.96, 703.96, 703.96, 703.96, 707.57, 707.57, 707.57, 707.57, 707.57, 714.1, 714.1, 714.1, 714.1, 714.1, 719.5, 719.5, 719.5, 719.5, 719.5, 724.18, 724.18, 724.18, 724.18, 724.18, 723.07, 723.07, 723.07, 723.07, 723.07, 721.5, 721.5, 721.5, 721.5, 721.5, 720.37, 720.37, 720.37, 720.37, 720.37, 722.7, 722.7, 722.7, 722.7, 722.7, 722.85, 722.85, 722.85, 722.85, 722.85, 717.42, 717.42, 717.42, 717.42, 717.42, 709.26, 709.26, 709.26, 709.26, 709.26, 705.94, 705.94, 705.94, 705.94, 705.94, 703.35, 703.35, 703.35, 703.35, 703.35, 702.98, 702.98, 702.98, 702.98, 702.98, 701.05, 701.05, 701.05, 701.05, 701.05, 703.23, 703.23, 703.23, 703.23, 703.23, 705.4, 705.4, 705.4, 705.4, 705.4, 704.79, 704.79, 704.79, 704.79, 704.79, 704.84, 704.84, 704.84, 704.84, 704.84, 708.25, 708.25, 708.25, 708.25, 708.25, 708.64, 708.64, 708.64, 708.64, 708.64, 710.84, 710.84, 710.84]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 460 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713554372 --> 1713554996
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.52, 38.52, 38.52, 38.52, 38.52, 18.73, 18.73, 18.73, 18.73, 18.73, 19.49, 19.49, 19.49, 19.49, 19.49, 24.26, 24.26, 24.26, 24.26, 24.26, 24.04, 24.04, 24.04, 24.04, 24.04, 24.16, 24.16, 24.16, 24.16, 24.16, 24.56, 24.56, 24.56, 24.56, 24.56, 25.49, 25.49, 25.49, 25.49, 25.49, 25.54, 25.54, 25.54, 25.54, 25.54, 25.59, 25.59, 25.59, 25.59, 25.59, 25.08, 25.08, 25.08, 25.08, 25.08, 24.64, 24.64, 24.64, 24.64, 24.64, 24.24, 24.24, 24.24, 24.24, 24.24, 24.11, 24.11, 24.11, 24.11, 24.11, 23.83, 23.83, 23.83, 23.83, 23.83, 23.24, 23.24, 23.24, 23.24, 23.24, 22.88, 22.88, 22.88, 22.88, 22.88, 22.42, 22.42, 22.42, 22.42, 22.42, 22.32, 22.32, 22.32, 22.32, 22.32, 22.47, 22.47, 22.47, 22.47, 22.47, 22.64, 22.64, 22.64, 22.64, 22.64, 22.63, 22.63, 22.63, 22.63, 22.63, 22.4, 22.4, 22.4, 22.4, 22.4, 22.31, 22.31, 22.31, 22.31, 22.31, 22.19, 22.19, 22.19, 22.19, 22.19, 22.01, 22.01, 22.01, 22.01, 22.01, 22.18, 22.18, 22.18, 22.18, 22.18, 22.33, 22.33, 22.33, 22.33, 22.33, 22.49, 22.49, 22.49, 22.49, 22.49, 22.56, 22.56, 22.56, 22.56, 22.56, 22.6, 22.6, 22.6, 22.6, 22.6, 22.66, 22.66, 22.66, 22.66, 22.66, 22.4, 22.4, 22.4, 22.4, 22.4, 22.44, 22.44, 22.44, 22.44, 22.44, 22.8, 22.8, 22.8, 22.8, 22.8, 22.89, 22.89, 22.89, 22.89, 22.89, 23.01, 23.01, 23.01, 23.01, 23.01, 23.13, 23.13, 23.13, 23.13, 23.13, 23.22, 23.22, 23.22, 23.22, 23.22, 23.23, 23.23, 23.23, 23.23, 23.23, 23.18, 23.18, 23.18, 23.18, 23.18, 23.11, 23.11, 23.11, 23.11, 23.11, 23.11, 23.11, 23.11, 23.11, 23.11, 22.79, 22.79, 22.79, 22.79, 22.79, 22.75, 22.75, 22.75, 22.75, 22.75, 22.85, 22.85, 22.85, 22.85, 22.85, 22.99, 22.99, 22.99, 22.99, 22.99, 23.03, 23.03, 23.03, 23.03, 23.03, 23.11, 23.11, 23.11, 23.11, 23.11, 23.12, 23.12, 23.12, 23.12, 23.12, 22.75, 22.75, 22.75, 22.75, 22.75, 22.62, 22.62, 22.62, 22.62, 22.62, 22.62, 22.62, 22.62, 22.62, 22.62, 22.17, 22.17, 22.17, 22.17, 22.17, 21.53, 21.53, 21.53, 21.53, 21.53, 21.46, 21.46, 21.46, 21.46, 21.46, 21.46, 21.46, 21.46, 21.46, 21.46, 21.55, 21.55, 21.55, 21.55, 21.55, 21.56, 21.56, 21.56, 21.56, 21.56, 21.65, 21.65, 21.65, 21.65, 21.65, 21.73, 21.73, 21.73]

Details

kv_cache_usage_ratio

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 460 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713554372 --> 1713554996
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.12, 0.12, 0.12, 0.12, 0.12, 0.24, 0.24, 0.24, 0.24, 0.24, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.31, 0.31, 0.31, 0.31, 0.31, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.28, 0.28, 0.28, 0.28, 0.28, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.33, 0.33, 0.33, 0.33, 0.33, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.29, 0.29, 0.29, 0.29, 0.29, 0.45, 0.45, 0.45, 0.45, 0.45, 0.43, 0.43, 0.43, 0.43, 0.43, 0.46, 0.46, 0.46, 0.46, 0.46, 0.49, 0.49, 0.49, 0.49, 0.49, 0.39, 0.39, 0.39, 0.39, 0.39, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 460 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713554372 --> 1713554996
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]

Apr 19 '24 19:04 github-actions[bot]

This would be simple if context shifting was opt-in, then there would always be a hard limit of n_ctx tokens. I am not sure that enabling context shift by default and without a way to disable it is a good idea, it can result in very poor generation quality and it is not the normal behavior for LLM providers, as far as I know.

Apr 19 '24 19:04 slaren

This would be simple if context shifting was opt-in, then there would always be a hard limit of n_ctx tokens. I am not sure that enabling context shift by default and without a way to disable it is a good idea, it can result in very poor generation quality and it is not the normal behavior for LLM providers, as far as I know.

Oh yes, and it is so slow in the current implementation, blocking the whole server.

Apr 19 '24 19:04 phymbert

@ggerganov up to you, but we need to stop this infinite loop recurrent concern some way.

@ggerganov I think with the removal of hard coded stop tokens, this PR is becoming more important

Apr 22 '24 11:04 phymbert

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

Yes, let's do that. Context-shift has to be refactored and become optional (in a future PR)

Apr 22 '24 11:04 ggerganov

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

Yes, let's do that. Context-shift has to be refactored and become optional (in a future PR)

@ggerganov @slaren Finally I prefer to keep checking at each token if we do not exceed n_ctx_train because one can simply pass n_predict = -1 in the request payload and the server will still go to infinite loop. I feel this approach safer with a proper warning.

Apr 26 '24 09:04 phymbert

llama.cpp
llama.cpp copied to clipboard

server: stop generation at `n_ctx_train` if `n_predict` is not set

Context

Tests

llama.cpp llama.cpp copied to clipboard

server: stop generation at `n_ctx_train` if `n_predict` is not set

Context

Tests

llama.cpp
llama.cpp copied to clipboard