ik_llama.cpp icon indicating copy to clipboard operation
ik_llama.cpp copied to clipboard

Bug: llama-server won't stop generation when client disconnects during prompt processing

Open hksdpc255 opened this issue 1 month ago • 1 comments

What happened?

When a client disconnects while llama-server is still processing the prompt (before any token is streamed), the server continues running the generation until completion. This wastes compute and keeps the model busy even though no client is connected to receive the output.

  1. Start llama-server with any model.
  2. Send a /v1/chat/completions request with a moderately large prompt.
  3. Disconnect the client immediately after the request is sent (e.g., terminate curl, close browser tab, cancel HTTP request in client).
  4. Observe that llama-server keeps generating tokens until completion even though no client is connected.

Name and Version

at least for commit: 5f3485c2c251210e09a1216f2cb1c84e02594c62

What operating system are you seeing the problem on?

Linux

Relevant log output


hksdpc255 avatar Nov 28 '25 09:11 hksdpc255

Would totally upvote this.

It also doesn't stop generating if the cancel request has been sent - either by pressing a stop button in web ui or if I press the cancel button in roocode.

If the context is big (~200k) it's much faster to just Ctrl-C the app and restart it then wait for processing to finish - especially on huge models like Kimi K2 1T.

moooV252 avatar Nov 30 '25 08:11 moooV252

I quite often encounter this issue when using Roo Code with Kimi K2 0905. When I interrupt the task, often no cancel request gets send, it then faster to kill llama-server and reload the cache than to wait for it to complete if it started generating long output. It would be awesome if it could stop automatically when no client is connected.

Lissanro avatar Dec 14 '25 04:12 Lissanro

Absolutely unable to reproduce.

Quote:

pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | high -m Kimi-K2-Thinking-smol-IQ4_KSS -f "explain the code"
[INFO] Available models (2):
[INFO]   >> Kimi-K2-Thinking-smol-IQ4_KSS <<
[INFO]     DeepSeek-V3.1-Terminus_5.4498bpw
The user wants me to explain this Python script, `quant_assign.py`, which is part of "Thireus' GGUF Tool Suite". This appears to be a tool for creating quantization recipes for large language models (LLMs) in the GGUF format. Let me analyze the code structure and functionality.

**Overview:**
This script takes a CSV file with perplexity (PPL) measurements for different quantization types and assigns optimal quantization levels to individual tensors based on size constraints and performance metrics. It's designed to work with MoE (Mixture of^C[ERROR] Chat request failed after 10000 seconds: Operation was aborted by an application callback

[Conversation discarded] (**Ctrl + C pressed**)

Server does not generate the thing later on. I am unable to send the same request afterwards with no issues.

Quote:


(.python) xxx:/opt/ubergarm/GLM-4.5-Air-GGUF/IQ4_K# pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | high -m Kimi-K2-Thinking-smol-IQ4_KSS -f "explain the code"
⣆  Resolving...[INFO] Available models (2):
[INFO]   >> Kimi-K2-Thinking-smol-IQ4_KSS <<
[INFO]     DeepSeek-V3.1-Terminus_5.4498bpw
The user wants a code explanation for this file, `quant_assign.py`, which is part of "Thireus'^C[ERROR] Chat request failed after 10000 seconds: Operation was aborted by an application callback

 (**Ctrl + C pressed**)
[Conversation discarded]
(.python) xxx:/opt/ubergarm/GLM-4.5-Air-GGUF/IQ4_K# pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | high -m Kimi-K2-Thinking-smol-IQ4_KSS -f "explain the code"
[INFO] Available models (2):
[INFO]   >> Kimi-K2-Thinking-smol-IQ4_KSS <<
[INFO]     DeepSeek-V3.1-Terminus_5.4498bpw
The user wants me to explain this Python script.^C[ERROR] Chat request failed after 10000 seconds: Operation was aborted by an application callback

 (**Ctrl + C pressed**)
[Conversation discarded]
(.python) xxx:/opt/ubergarm/GLM-4.5-Air-GGUF/IQ4_K# pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | high -m Kimi-K2-Thinking-smol-IQ4_KSS -f "explain the code"
[INFO] Available models (2):
[INFO]   >> Kimi-K2-Thinking-smol-IQ4_KSS <<
[INFO]     DeepSeek-V3.1-Terminus_5.4498bpw
This is a comprehensive Python script that appears to be^C[ERROR] Chat request failed after 10000 seconds: Operation was aborted by an application callback

 (**Ctrl + C pressed**)
[Conversation discarded]
(.python) xxx:/opt/ubergarm/GLM-4.5-Air-GGUF/IQ4_K# pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | high -m Kimi-K2-Thinking-smol-IQ4_KSS -f "explain the code"
[INFO] Available models (2):
[INFO]   >> Kimi-K2-Thinking-smol-IQ4_KSS <<
[INFO]     DeepSeek-V3.1-Terminus_5.4498bpw
The user is asking me to explain a Python script named `quant_assign.py^C[ERROR] Chat request failed after 10000 seconds: Operation was aborted by an application callback

 (**Ctrl + C pressed**)
[Conversation discarded]
(.python) xxx:/opt/ubergarm/GLM-4.5-Air-GGUF/IQ4_K#

I believe its not a problem of ik_llama.cpp but its the problem of a shitty MITM server that you're using.

magikRUKKOLA avatar Dec 15 '25 01:12 magikRUKKOLA

@Lissanro

I quite often encounter this issue when using Roo Code with Kimi K2 0905.

Yep, its not a problem of ik_llama.cpp for sure. Its a problem of a shitty code of a Roo Code. I believe you have to create an issue in the Roo Code project, not here.

magikRUKKOLA avatar Dec 15 '25 01:12 magikRUKKOLA

@magikRUKKOLA This should easily be reproduced using curl

hksdpc255 avatar Dec 15 '25 01:12 hksdpc255

This should easily be reproduced using curl

Why would I use curl for that?! I am using nginx.

One have to interrupt the connection to the server properly.

magikRUKKOLA avatar Dec 15 '25 01:12 magikRUKKOLA

You can try it with curl. It's simple :)

hksdpc255 avatar Dec 15 '25 01:12 hksdpc255

curl to your nginx reverse-proxy can also reproduce I think.

hksdpc255 avatar Dec 15 '25 02:12 hksdpc255

curl to your nginx reverse-proxy can also reproduce I think.

I don't think so. Here is what I am using in my nginx MITM.

        proxy_ignore_client_abort on;
        lua_check_client_abort on;

docs:

lua_check_client_abort
syntax: lua_check_client_abort on|off

default: lua_check_client_abort off

context: http, server, location, location-if

This directive controls whether to check for premature client connection abortion.

When this directive is turned on, the ngx_lua module will monitor the premature connection close event on the downstream connections. And when there is such an event, it will call the user Lua function callback (registered by [ngx.on_abort](https://openresty-reference.readthedocs.io/en/latest/Directives/#ngxon_abort)) or just stop and clean up all the Lua "light threads" running in the current request's request handler when there is no user callback function registered.

According to the current implementation, however, if the client closes the connection before the Lua code finishes reading the request body data via [ngx.req.socket](https://openresty-reference.readthedocs.io/en/latest/Directives/#ngxreqsocket), then ngx_lua will neither stop all the running "light threads" nor call the user callback (if [ngx.on_abort](https://openresty-reference.readthedocs.io/en/latest/Directives/#ngxon_abort) has been called). Instead, the reading operation on [ngx.req.socket](https://openresty-reference.readthedocs.io/en/latest/Directives/#ngxreqsocket) will just return the error message "client aborted" as the second return value (the first return value is surely nil).

When TCP keepalive is disabled, it is relying on the client side to close the socket gracefully (by sending a FIN packet or something like that). For (soft) real-time web applications, it is highly recommended to configure the [TCP keepalive](http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html) support in your system's TCP stack implementation in order to detect "half-open" TCP connections in time.

For example, on Linux, you can configure the standard [listen](http://nginx.org/en/docs/http/ngx_http_core_module.html#listen) directive in your nginx.conf file like this:


 listen 80 so_keepalive=2s:2s:8;
On FreeBSD, you can only tune the system-wide configuration for TCP keepalive, for example:

# sysctl net.inet.tcp.keepintvl=2000
# sysctl net.inet.tcp.keepidle=2000
This directive was first introduced in the v0.7.4 release.

See also [ngx.on_abort](https://openresty-reference.readthedocs.io/en/latest/Directives/#ngxon_abort).

magikRUKKOLA avatar Dec 15 '25 02:12 magikRUKKOLA

You can try it with curl. It's simple :)

Well, not for me. I am using a custom client (alike charmbracelet/mods but faster) to connect to my MITM proxy. Its called high. What would you suggest? To send some custom HTTP to the MITM via curl instead of my high? I don't get the point. And its not as simple for me.

Can you please provide some bash code with custom curl requests so I can test?

magikRUKKOLA avatar Dec 15 '25 02:12 magikRUKKOLA

This bug report describes an issue where, under a very simple setup with a direct connection to llama-server, an unexpected client termination does not stop generation on the server side.

Specifically, if a request is sent to llama-server (for example via curl) and the client process is terminated before the prefill phase completes, the server continues generating tokens until completion instead of aborting the request.

This is easy to reproduce:

  1. Start a new llama-server instance with a small model.
  2. Send a generation request using curl.
  3. Immediately terminate the curl process before prefill finishes.
  4. Observe that llama-server continues generation even though the client connection has been closed.

I'm not sure what's your pre and high command actually do.

hksdpc255 avatar Dec 15 '25 02:12 hksdpc255

@hksdpc255

I'm not sure what's your pre and high command actually do.

pre is just:

#!/bin/bash

_show_files() {
    # Handle case with no arguments
    if [ $# -eq 0 ]; then
        find . -maxdepth 2 -type f -exec sh -c '
            for f; do
                echo "File: $(realpath "$f")"
                echo "\`\`\`"
                cat "$f"
                if [ -s "$f" ]; then
                    if [ -n "$(tail -c1 "$f" | tr -d '\n')" ]; then
                        echo
                    fi
                fi
                echo "\`\`\`"
                echo
            done' sh {} +
        return 0
    fi

    local files=()
    local dirs=()
    local patterns=()

    # Classify arguments
    for arg in "$@"; do
        if [ -f "$arg" ]; then
            files+=("$arg")
        elif [ -d "$arg" ]; then
            dirs+=("$arg")
        else
            patterns+=("$arg")
        fi
    done

    # Process direct files
    for file in "${files[@]}"; do
        echo "File: $(realpath "$file")"
        echo "\`\`\`"
        cat "$file"
        if [ -s "$file" ]; then
            if [ -n "$(tail -c1 "$file" | tr -d '\n')" ]; then
                echo
            fi
        fi
        echo "\`\`\`"
        echo
    done

    # Process directory searches
    if [ ${#dirs[@]} -gt 0 ]; then
        local search_dir="${dirs[0]}"
        local name_pattern="${patterns[0]:-*}"

        find "$search_dir" -maxdepth 2 -type f -name "$name_pattern" -exec sh -c '
            for f; do
                echo "File: $(realpath "$f")"
                echo "\`\`\`"
                cat "$f"
                if [ -s "$f" ]; then
                    if [ -n "$(tail -c1 "$f" | tr -d '\n')" ]; then
                        echo
                    fi
                fi
                echo "\`\`\`"
                echo
            done' sh {} +
    fi

    # Handle pattern-only search (no directories)
    if [ ${#dirs[@]} -eq 0 ] && [ ${#patterns[@]} -gt 0 ]; then
        find . -maxdepth 2 -type f -name "${patterns[0]}" -exec sh -c '
            for f; do
                echo "File: $(realpath "$f")"
                echo "\`\`\`"
                cat "$f"
                if [ -s "$f" ]; then
                    if [ -n "$(tail -c1 "$f" | tr -d '\n')" ]; then
                        echo
                    fi
                fi
                echo "\`\`\`"
                echo
            done' sh {} +
    fi
}

the high is a client written in cpp which sends the proper json to the MITM HTTP. (alike mods as mentioned above).

Ok may let me try with curl.

magikRUKKOLA avatar Dec 15 '25 02:12 magikRUKKOLA

@hksdpc255

xxx:/usr/local/nginx/conf/sites-available/llm.test# ./test.sh single Kimi-K2-Thinking-smol-IQ4_KSS

=== Testing model: Kimi-K2-Thinking-smol-IQ4_KSS ====
Question: Write a Python script for a 1000 bouncing balls in...

The user wants me to create a Python script that simulates:
1. 1000 bouncing balls
2. Inside a spinning heptagon (7-sided polygon)
3. Proper^C
xxx:/usr/local/nginx/conf/sites-available/llm.test# ./test.sh single Kimi-K2-Thinking-smol-IQ4_KSS

=== Testing model: Kimi-K2-Thinking-smol-IQ4_KSS ====
Question: Write a Python script for a 1000 bouncing balls in...

The user is asking for quite a complex script:

1. 1000 bouncing balls
2. Inside a spinning heptagon
3. Proper collision detection
4. ASCII-compatible console output of^C
xxx:/usr/local/nginx/conf/sites-available/llm.test# ./test.sh single Kimi-K2-Thinking-smol-IQ4_KSS

=== Testing model: Kimi-K2-Thinking-smol-IQ4_KSS ====
Question: Write a Python script for a 1000 bouncing balls in...

The user wants me to write a Python script that simulates:
1. 1000 bouncing balls
2. Inside a spinning heptagon (7-sided polygon)
3. Proper collision detection
4.^C
xxx:/usr/local/nginx/conf/sites-available/llm.test#

test.sh


#!/usr/bin/env bash

# Enhanced LLM API test script with dynamic model discovery and high timeout

set +e

# Configuration
API_BASE="http://localhost:8042"
CACHE_DIR="/var/cache/nginx/llm_cache"
TIMEOUT=3600  # High timeout for all model runs (1 hour)
QUESTIONS=(
    "Write a Python script for a 1000 bouncing balls inside a spinning heptagon with proper collision detection.  Produce the ascii-compatible console output of a frame after the 5 seconds of the simulation.  Make sure everything is straight -- that is, all the balls are inside of the spinning heptagon (write a special function to check for this).  Keep trying until success.  Make sure to use CUDA. /python"
    "Output the mandelbrot fractal using python in ascii format /python"
    "read a random number via /dev/random, calculate modulo 42 and output the result.  make a separate consequent tool call for each step /bash"
    "Calculate the factorial of 142.  /python"
    "Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever?"
)

# Common setup
setup() {
    rm -rf "${CACHE_DIR}"/*
    chown www-data:www-data -R "${CACHE_DIR}"
    systemctl restart openresty
    sleep 0.5
}

# Get available models from API
get_models() {
    curl -s "${API_BASE}/v1/models" | jq -r '.data[].id' | sort -u
}

# Common response processor
process_response() {
    while IFS= read -r line; do
        clean_line=$(echo "$line" | sed 's/^data: //; s/^\[DONE\]//')
        [ -z "$clean_line" ] && continue

        json_content=$(echo "$clean_line" | jq '
            if .choices[0]?.delta?.content != null then
                .choices[0].delta.content
            elif .choices[0]?.delta?.reasoning_content != null then
                .choices[0].delta.reasoning_content
            elif .choices[0]?.delta?.tool_calls?[0]?.function?.arguments != null then
                (.choices[0].delta.tool_calls[0].function.arguments | fromjson.code // empty)
            elif .choices[0]?.message?.content != null then
                .choices[0].message.content
            else
                empty
            end' 2>/dev/null)

        if [ -n "$json_content" ]; then
            content=$(echo "$json_content" | cut -c2- | rev | cut -c2- | rev)
            echo -en "$content"
        #else
        #    echo "$clean_line" | jq .
        fi
    done
}

# Run test question against a model
run_test() {
    local model="$1"
    local question="$2"

    echo -e "\n=== Testing model: $model ===="
    echo -e "Question: ${question:0:50}...\n"

    curl --max-time "${TIMEOUT}" --connect-timeout "${TIMEOUT}" -sN -X POST "${API_BASE}/v1/chat/completions" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "'"${model}"'",
        "stream": true,
        "messages": [
          {"role": "user", "content": "'"${question}"'"}
        ]
      }' | process_response
}

# Run all tests against all models
run_all_tests() {
    setup
    local models=($(get_models))

    if [ ${#models[@]} -eq 0 ]; then
        echo "Error: No models found at ${API_BASE}"
        return 1
    fi

    echo "Discovered models: ${models[*]}"
    echo "Running ${#QUESTIONS[@]} questions against each model with timeout: ${TIMEOUT}s"

    for model in "${models[@]}"; do
        echo -e "\n\n===== Testing Model: ${model} ====="
        for ((i=0; i<${#QUESTIONS[@]}; i++)); do
            echo -e "\n--- Question $((i+1)) ---"
            run_test "${model}" "${QUESTIONS[$i]}"
        done
    done
}

# Test naming conversation
test_naming() {
    setup
    local models=($(get_models))
    local payload_template='{"model":"MODEL_PLACEHOLDER","temperature":0.1,"messages":[{"role":"user","content":"Based on the chat history, give this conversation a name.\nKeep it short - 10 characters max, no quotes.\nUse English.\nJust provide the name, nothing else.\n\nHere'\''s the conversation:\n\n```\nwrite  something using python\n\n---------\n\n\n```\n\nName this conversation in 10 characters or less.\nUse English.\nOnly give the name, nothing else.\n\nThe name is:"}],"stream":true}'

    for model in "${models[@]}"; do
        echo "=== Processing model: $model ==="
        local payload="${payload_template/MODEL_PLACEHOLDER/$model}"
        curl -sN -X POST "${API_BASE}/v1/chat/completions" \
          -H 'Content-Type: application/json' \
          -d "$payload" | process_response
        echo -e "\n"
    done
}

# Main
case "$1" in
    models) get_models ;;
    naming) test_naming ;;
    all) run_all_tests ;;
    single)
        if [ -z "$2" ]; then
            echo "Usage: $0 single <model_name>"
            exit 1
        fi
        setup
        run_test "$2" "${QUESTIONS[0]}"
        ;;
    *)
        echo "Usage: $0 {models|naming|all|single <model_name>}"
        echo "  models   - List available models"
        echo "  naming   - Test conversation naming"
        echo "  all      - Run all tests against all models"
        echo "  single   - Run single test against specified model"
        exit 1
        ;;
esac

Good enough?

As you can see, I am using curl. And doing Ctrl + C without any issues. The problem is with your MITM software. As mentioned above, you're using some shitty code.

magikRUKKOLA avatar Dec 15 '25 02:12 magikRUKKOLA

I'm not using any MITM software though. I'm drectly connecting curl to llama-server.

hksdpc255 avatar Dec 15 '25 02:12 hksdpc255

I'm not using any MITM software though. I'm drectly connecting curl to llama-server.

Well, then you want the llama-server to implement the same feature as I mentioned above, the lua_check_client_abort. You want this feature to be integrated into the ik_llama.cpp. Is that correct?

magikRUKKOLA avatar Dec 15 '25 02:12 magikRUKKOLA

For reference, here is a screen recording demonstrating the issue.

Record.zip

hksdpc255 avatar Dec 15 '25 02:12 hksdpc255

Another sample (the files need to be renamed to .z01, .z02, .z03 before extraction):

Record.webm.zip

Record.webm.z01

Record.webm.z02

Record.webm.z03

hksdpc255 avatar Dec 15 '25 03:12 hksdpc255

Well, then you want the llama-server to implement the same feature as I mentioned above, the lua_check_client_abort. You want this feature to be integrated into the ik_llama.cpp. Is that correct?

ik_llama.cpp needs to interrupt disconnected request just like the mainline llama.cpp do.

hksdpc255 avatar Dec 15 '25 03:12 hksdpc255

@hksdpc255

Okay that clears up the issue. Thanks!

So the problem you're describing can be solved by implementing the same functionality as described above, namely:

lua_check_client_abort

https://github.com/ikawrakow/ik_llama.cpp/issues/1020#issuecomment-3652600031

Yeah, it can be done for sure. If someone would be generous enough to look up the code of nginx, to cherry-pick the algo and implement it in the server of ik_llama.cpp that would be cool since that would solve the proper interruption problem.

magikRUKKOLA avatar Dec 15 '25 03:12 magikRUKKOLA

@hksdpc255

ik_llama.cpp needs to interrupt disconnected request just like the mainline llama.cpp do.

That is interesting. The mainline you're saying is already implemented the abovementioned functionality?

magikRUKKOLA avatar Dec 15 '25 03:12 magikRUKKOLA

@hksdpc255

This is very intriguing! Apparently the mainline uses the httplib:

/opt/llama.cpp/llama.cpp# grep -rn -Fa --colour TCP_ --include=*.c*
ggml/src/ggml-rpc/ggml-rpc.cpp:306:    // set TCP_NODELAY to disable Nagle's algorithm
ggml/src/ggml-rpc/ggml-rpc.cpp:307:    int ret = setsockopt(sockfd, IPPROTO_TCP, TCP_NODELAY, (char *)&flag, sizeof(int));
ggml/src/ggml-rpc/ggml-rpc.cpp:325:        GGML_LOG_ERROR("Failed to set TCP_NODELAY\n");
ggml/src/ggml-rpc/ggml-rpc.cpp:349:        GGML_LOG_ERROR("Failed to set TCP_NODELAY\n");
vendor/cpp-httplib/httplib.cpp:1270:    if (tcp_nodelay) { set_socket_opt(sock, IPPROTO_TCP, TCP_NODELAY, 1); }

while the ik_llama.cpp doesn't do that. namely:

 ls -alh /opt/ik_llama.cpp/ik_llama.cpp/vendor/
total 24K
drwxr-xr-x  6 root root 4.0K Sep 28 04:35 .
drwxr-xr-x 27 root root 4.0K Dec 15 05:25 ..
drwxr-xr-x  2 root root 4.0K Sep 28 04:35 miniaudio
drwxr-xr-x  2 root root 4.0K Dec 15 05:25 minja
drwxr-xr-x  2 root root 4.0K Sep 28 04:35 nlohmann
drwxr-xr-x  2 root root 4.0K Sep 28 04:35 stb

that is, there is no dependency of httplib. That might be the issue.

magikRUKKOLA avatar Dec 15 '25 03:12 magikRUKKOLA

@hksdpc255

Oh I think I see the issue:

5760   // Setup `is_connection_closed` method
5761   auto sock = strm.socket();
5762   req.is_connection_closed = [sock]() {
5763     return !detail::is_socket_alive(sock);
5764   };

So this setup is implemented in httplib. The similar functionality has to be implemented in ik_llama.cpp too.

magikRUKKOLA avatar Dec 15 '25 03:12 magikRUKKOLA

I’m not familiar with the server code, so I’m unable to provide a fix myself. I can only report the issue for now.

hksdpc255 avatar Dec 15 '25 06:12 hksdpc255

@magikRUKKOLA

Yep, its not a problem of ik_llama.cpp for sure. Its a problem of a shitty code of a Roo Code

It's also showing this behavior on OpenWebUI - if I press the cancel button server keeps processing till the end. After this I can even press "try again" button which triggers a new request - but since the request context is exactly the same it just takes the answer from the ram cache and outputs it right away.

It can even be considered a feature for very long-running tasks (24h+) when all timeouts are already reached so you can retrieve the answer using a retry command (works in roocode the same way). However, it's also a bug that it doesn't stop generating when the cancel button gets pressed explicitly (tested in two different apps).

moooV252 avatar Dec 15 '25 06:12 moooV252