Bug: llama-server won't stop generation when client disconnects during prompt processing
What happened?
When a client disconnects while llama-server is still processing the prompt (before any token is streamed), the server continues running the generation until completion. This wastes compute and keeps the model busy even though no client is connected to receive the output.
- Start llama-server with any model.
- Send a /v1/chat/completions request with a moderately large prompt.
- Disconnect the client immediately after the request is sent (e.g., terminate curl, close browser tab, cancel HTTP request in client).
- Observe that llama-server keeps generating tokens until completion even though no client is connected.
Name and Version
at least for commit: 5f3485c2c251210e09a1216f2cb1c84e02594c62
What operating system are you seeing the problem on?
Linux
Relevant log output
Would totally upvote this.
It also doesn't stop generating if the cancel request has been sent - either by pressing a stop button in web ui or if I press the cancel button in roocode.
If the context is big (~200k) it's much faster to just Ctrl-C the app and restart it then wait for processing to finish - especially on huge models like Kimi K2 1T.
I quite often encounter this issue when using Roo Code with Kimi K2 0905. When I interrupt the task, often no cancel request gets send, it then faster to kill llama-server and reload the cache than to wait for it to complete if it started generating long output. It would be awesome if it could stop automatically when no client is connected.
Absolutely unable to reproduce.
Quote:
pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | high -m Kimi-K2-Thinking-smol-IQ4_KSS -f "explain the code"
[INFO] Available models (2):
[INFO] >> Kimi-K2-Thinking-smol-IQ4_KSS <<
[INFO] DeepSeek-V3.1-Terminus_5.4498bpw
The user wants me to explain this Python script, `quant_assign.py`, which is part of "Thireus' GGUF Tool Suite". This appears to be a tool for creating quantization recipes for large language models (LLMs) in the GGUF format. Let me analyze the code structure and functionality.
**Overview:**
This script takes a CSV file with perplexity (PPL) measurements for different quantization types and assigns optimal quantization levels to individual tensors based on size constraints and performance metrics. It's designed to work with MoE (Mixture of^C[ERROR] Chat request failed after 10000 seconds: Operation was aborted by an application callback
[Conversation discarded] (**Ctrl + C pressed**)
Server does not generate the thing later on. I am unable to send the same request afterwards with no issues.
Quote:
(.python) xxx:/opt/ubergarm/GLM-4.5-Air-GGUF/IQ4_K# pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | high -m Kimi-K2-Thinking-smol-IQ4_KSS -f "explain the code"
⣆ Resolving...[INFO] Available models (2):
[INFO] >> Kimi-K2-Thinking-smol-IQ4_KSS <<
[INFO] DeepSeek-V3.1-Terminus_5.4498bpw
The user wants a code explanation for this file, `quant_assign.py`, which is part of "Thireus'^C[ERROR] Chat request failed after 10000 seconds: Operation was aborted by an application callback
(**Ctrl + C pressed**)
[Conversation discarded]
(.python) xxx:/opt/ubergarm/GLM-4.5-Air-GGUF/IQ4_K# pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | high -m Kimi-K2-Thinking-smol-IQ4_KSS -f "explain the code"
[INFO] Available models (2):
[INFO] >> Kimi-K2-Thinking-smol-IQ4_KSS <<
[INFO] DeepSeek-V3.1-Terminus_5.4498bpw
The user wants me to explain this Python script.^C[ERROR] Chat request failed after 10000 seconds: Operation was aborted by an application callback
(**Ctrl + C pressed**)
[Conversation discarded]
(.python) xxx:/opt/ubergarm/GLM-4.5-Air-GGUF/IQ4_K# pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | high -m Kimi-K2-Thinking-smol-IQ4_KSS -f "explain the code"
[INFO] Available models (2):
[INFO] >> Kimi-K2-Thinking-smol-IQ4_KSS <<
[INFO] DeepSeek-V3.1-Terminus_5.4498bpw
This is a comprehensive Python script that appears to be^C[ERROR] Chat request failed after 10000 seconds: Operation was aborted by an application callback
(**Ctrl + C pressed**)
[Conversation discarded]
(.python) xxx:/opt/ubergarm/GLM-4.5-Air-GGUF/IQ4_K# pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | high -m Kimi-K2-Thinking-smol-IQ4_KSS -f "explain the code"
[INFO] Available models (2):
[INFO] >> Kimi-K2-Thinking-smol-IQ4_KSS <<
[INFO] DeepSeek-V3.1-Terminus_5.4498bpw
The user is asking me to explain a Python script named `quant_assign.py^C[ERROR] Chat request failed after 10000 seconds: Operation was aborted by an application callback
(**Ctrl + C pressed**)
[Conversation discarded]
(.python) xxx:/opt/ubergarm/GLM-4.5-Air-GGUF/IQ4_K#
I believe its not a problem of ik_llama.cpp but its the problem of a shitty MITM server that you're using.
@Lissanro
I quite often encounter this issue when using Roo Code with Kimi K2 0905.
Yep, its not a problem of ik_llama.cpp for sure. Its a problem of a shitty code of a Roo Code. I believe you have to create an issue in the Roo Code project, not here.
@magikRUKKOLA This should easily be reproduced using curl
This should easily be reproduced using curl
Why would I use curl for that?! I am using nginx.
One have to interrupt the connection to the server properly.
You can try it with curl. It's simple :)
curl to your nginx reverse-proxy can also reproduce I think.
curl to your nginx reverse-proxy can also reproduce I think.
I don't think so. Here is what I am using in my nginx MITM.
proxy_ignore_client_abort on;
lua_check_client_abort on;
docs:
lua_check_client_abort
syntax: lua_check_client_abort on|off
default: lua_check_client_abort off
context: http, server, location, location-if
This directive controls whether to check for premature client connection abortion.
When this directive is turned on, the ngx_lua module will monitor the premature connection close event on the downstream connections. And when there is such an event, it will call the user Lua function callback (registered by [ngx.on_abort](https://openresty-reference.readthedocs.io/en/latest/Directives/#ngxon_abort)) or just stop and clean up all the Lua "light threads" running in the current request's request handler when there is no user callback function registered.
According to the current implementation, however, if the client closes the connection before the Lua code finishes reading the request body data via [ngx.req.socket](https://openresty-reference.readthedocs.io/en/latest/Directives/#ngxreqsocket), then ngx_lua will neither stop all the running "light threads" nor call the user callback (if [ngx.on_abort](https://openresty-reference.readthedocs.io/en/latest/Directives/#ngxon_abort) has been called). Instead, the reading operation on [ngx.req.socket](https://openresty-reference.readthedocs.io/en/latest/Directives/#ngxreqsocket) will just return the error message "client aborted" as the second return value (the first return value is surely nil).
When TCP keepalive is disabled, it is relying on the client side to close the socket gracefully (by sending a FIN packet or something like that). For (soft) real-time web applications, it is highly recommended to configure the [TCP keepalive](http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html) support in your system's TCP stack implementation in order to detect "half-open" TCP connections in time.
For example, on Linux, you can configure the standard [listen](http://nginx.org/en/docs/http/ngx_http_core_module.html#listen) directive in your nginx.conf file like this:
listen 80 so_keepalive=2s:2s:8;
On FreeBSD, you can only tune the system-wide configuration for TCP keepalive, for example:
# sysctl net.inet.tcp.keepintvl=2000
# sysctl net.inet.tcp.keepidle=2000
This directive was first introduced in the v0.7.4 release.
See also [ngx.on_abort](https://openresty-reference.readthedocs.io/en/latest/Directives/#ngxon_abort).
You can try it with curl. It's simple :)
Well, not for me. I am using a custom client (alike charmbracelet/mods but faster) to connect to my MITM proxy. Its called high. What would you suggest? To send some custom HTTP to the MITM via curl instead of my high? I don't get the point. And its not as simple for me.
Can you please provide some bash code with custom curl requests so I can test?
This bug report describes an issue where, under a very simple setup with a direct connection to llama-server, an unexpected client termination does not stop generation on the server side.
Specifically, if a request is sent to llama-server (for example via curl) and the client process is terminated before the prefill phase completes, the server continues generating tokens until completion instead of aborting the request.
This is easy to reproduce:
- Start a new
llama-serverinstance with a small model. - Send a generation request using
curl. - Immediately terminate the
curlprocess before prefill finishes. - Observe that
llama-servercontinues generation even though the client connection has been closed.
I'm not sure what's your pre and high command actually do.
@hksdpc255
I'm not sure what's your
preandhighcommand actually do.
pre is just:
#!/bin/bash
_show_files() {
# Handle case with no arguments
if [ $# -eq 0 ]; then
find . -maxdepth 2 -type f -exec sh -c '
for f; do
echo "File: $(realpath "$f")"
echo "\`\`\`"
cat "$f"
if [ -s "$f" ]; then
if [ -n "$(tail -c1 "$f" | tr -d '\n')" ]; then
echo
fi
fi
echo "\`\`\`"
echo
done' sh {} +
return 0
fi
local files=()
local dirs=()
local patterns=()
# Classify arguments
for arg in "$@"; do
if [ -f "$arg" ]; then
files+=("$arg")
elif [ -d "$arg" ]; then
dirs+=("$arg")
else
patterns+=("$arg")
fi
done
# Process direct files
for file in "${files[@]}"; do
echo "File: $(realpath "$file")"
echo "\`\`\`"
cat "$file"
if [ -s "$file" ]; then
if [ -n "$(tail -c1 "$file" | tr -d '\n')" ]; then
echo
fi
fi
echo "\`\`\`"
echo
done
# Process directory searches
if [ ${#dirs[@]} -gt 0 ]; then
local search_dir="${dirs[0]}"
local name_pattern="${patterns[0]:-*}"
find "$search_dir" -maxdepth 2 -type f -name "$name_pattern" -exec sh -c '
for f; do
echo "File: $(realpath "$f")"
echo "\`\`\`"
cat "$f"
if [ -s "$f" ]; then
if [ -n "$(tail -c1 "$f" | tr -d '\n')" ]; then
echo
fi
fi
echo "\`\`\`"
echo
done' sh {} +
fi
# Handle pattern-only search (no directories)
if [ ${#dirs[@]} -eq 0 ] && [ ${#patterns[@]} -gt 0 ]; then
find . -maxdepth 2 -type f -name "${patterns[0]}" -exec sh -c '
for f; do
echo "File: $(realpath "$f")"
echo "\`\`\`"
cat "$f"
if [ -s "$f" ]; then
if [ -n "$(tail -c1 "$f" | tr -d '\n')" ]; then
echo
fi
fi
echo "\`\`\`"
echo
done' sh {} +
fi
}
the high is a client written in cpp which sends the proper json to the MITM HTTP. (alike mods as mentioned above).
Ok may let me try with curl.
@hksdpc255
xxx:/usr/local/nginx/conf/sites-available/llm.test# ./test.sh single Kimi-K2-Thinking-smol-IQ4_KSS
=== Testing model: Kimi-K2-Thinking-smol-IQ4_KSS ====
Question: Write a Python script for a 1000 bouncing balls in...
The user wants me to create a Python script that simulates:
1. 1000 bouncing balls
2. Inside a spinning heptagon (7-sided polygon)
3. Proper^C
xxx:/usr/local/nginx/conf/sites-available/llm.test# ./test.sh single Kimi-K2-Thinking-smol-IQ4_KSS
=== Testing model: Kimi-K2-Thinking-smol-IQ4_KSS ====
Question: Write a Python script for a 1000 bouncing balls in...
The user is asking for quite a complex script:
1. 1000 bouncing balls
2. Inside a spinning heptagon
3. Proper collision detection
4. ASCII-compatible console output of^C
xxx:/usr/local/nginx/conf/sites-available/llm.test# ./test.sh single Kimi-K2-Thinking-smol-IQ4_KSS
=== Testing model: Kimi-K2-Thinking-smol-IQ4_KSS ====
Question: Write a Python script for a 1000 bouncing balls in...
The user wants me to write a Python script that simulates:
1. 1000 bouncing balls
2. Inside a spinning heptagon (7-sided polygon)
3. Proper collision detection
4.^C
xxx:/usr/local/nginx/conf/sites-available/llm.test#
test.sh
#!/usr/bin/env bash
# Enhanced LLM API test script with dynamic model discovery and high timeout
set +e
# Configuration
API_BASE="http://localhost:8042"
CACHE_DIR="/var/cache/nginx/llm_cache"
TIMEOUT=3600 # High timeout for all model runs (1 hour)
QUESTIONS=(
"Write a Python script for a 1000 bouncing balls inside a spinning heptagon with proper collision detection. Produce the ascii-compatible console output of a frame after the 5 seconds of the simulation. Make sure everything is straight -- that is, all the balls are inside of the spinning heptagon (write a special function to check for this). Keep trying until success. Make sure to use CUDA. /python"
"Output the mandelbrot fractal using python in ascii format /python"
"read a random number via /dev/random, calculate modulo 42 and output the result. make a separate consequent tool call for each step /bash"
"Calculate the factorial of 142. /python"
"Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever?"
)
# Common setup
setup() {
rm -rf "${CACHE_DIR}"/*
chown www-data:www-data -R "${CACHE_DIR}"
systemctl restart openresty
sleep 0.5
}
# Get available models from API
get_models() {
curl -s "${API_BASE}/v1/models" | jq -r '.data[].id' | sort -u
}
# Common response processor
process_response() {
while IFS= read -r line; do
clean_line=$(echo "$line" | sed 's/^data: //; s/^\[DONE\]//')
[ -z "$clean_line" ] && continue
json_content=$(echo "$clean_line" | jq '
if .choices[0]?.delta?.content != null then
.choices[0].delta.content
elif .choices[0]?.delta?.reasoning_content != null then
.choices[0].delta.reasoning_content
elif .choices[0]?.delta?.tool_calls?[0]?.function?.arguments != null then
(.choices[0].delta.tool_calls[0].function.arguments | fromjson.code // empty)
elif .choices[0]?.message?.content != null then
.choices[0].message.content
else
empty
end' 2>/dev/null)
if [ -n "$json_content" ]; then
content=$(echo "$json_content" | cut -c2- | rev | cut -c2- | rev)
echo -en "$content"
#else
# echo "$clean_line" | jq .
fi
done
}
# Run test question against a model
run_test() {
local model="$1"
local question="$2"
echo -e "\n=== Testing model: $model ===="
echo -e "Question: ${question:0:50}...\n"
curl --max-time "${TIMEOUT}" --connect-timeout "${TIMEOUT}" -sN -X POST "${API_BASE}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "'"${model}"'",
"stream": true,
"messages": [
{"role": "user", "content": "'"${question}"'"}
]
}' | process_response
}
# Run all tests against all models
run_all_tests() {
setup
local models=($(get_models))
if [ ${#models[@]} -eq 0 ]; then
echo "Error: No models found at ${API_BASE}"
return 1
fi
echo "Discovered models: ${models[*]}"
echo "Running ${#QUESTIONS[@]} questions against each model with timeout: ${TIMEOUT}s"
for model in "${models[@]}"; do
echo -e "\n\n===== Testing Model: ${model} ====="
for ((i=0; i<${#QUESTIONS[@]}; i++)); do
echo -e "\n--- Question $((i+1)) ---"
run_test "${model}" "${QUESTIONS[$i]}"
done
done
}
# Test naming conversation
test_naming() {
setup
local models=($(get_models))
local payload_template='{"model":"MODEL_PLACEHOLDER","temperature":0.1,"messages":[{"role":"user","content":"Based on the chat history, give this conversation a name.\nKeep it short - 10 characters max, no quotes.\nUse English.\nJust provide the name, nothing else.\n\nHere'\''s the conversation:\n\n```\nwrite something using python\n\n---------\n\n\n```\n\nName this conversation in 10 characters or less.\nUse English.\nOnly give the name, nothing else.\n\nThe name is:"}],"stream":true}'
for model in "${models[@]}"; do
echo "=== Processing model: $model ==="
local payload="${payload_template/MODEL_PLACEHOLDER/$model}"
curl -sN -X POST "${API_BASE}/v1/chat/completions" \
-H 'Content-Type: application/json' \
-d "$payload" | process_response
echo -e "\n"
done
}
# Main
case "$1" in
models) get_models ;;
naming) test_naming ;;
all) run_all_tests ;;
single)
if [ -z "$2" ]; then
echo "Usage: $0 single <model_name>"
exit 1
fi
setup
run_test "$2" "${QUESTIONS[0]}"
;;
*)
echo "Usage: $0 {models|naming|all|single <model_name>}"
echo " models - List available models"
echo " naming - Test conversation naming"
echo " all - Run all tests against all models"
echo " single - Run single test against specified model"
exit 1
;;
esac
Good enough?
As you can see, I am using curl. And doing Ctrl + C without any issues. The problem is with your MITM software. As mentioned above, you're using some shitty code.
I'm not using any MITM software though. I'm drectly connecting curl to llama-server.
I'm not using any MITM software though. I'm drectly connecting curl to llama-server.
Well, then you want the llama-server to implement the same feature as I mentioned above, the lua_check_client_abort. You want this feature to be integrated into the ik_llama.cpp. Is that correct?
Another sample (the files need to be renamed to .z01, .z02, .z03 before extraction):
Well, then you want the llama-server to implement the same feature as I mentioned above, the
lua_check_client_abort. You want this feature to be integrated into the ik_llama.cpp. Is that correct?
ik_llama.cpp needs to interrupt disconnected request just like the mainline llama.cpp do.
@hksdpc255
Okay that clears up the issue. Thanks!
So the problem you're describing can be solved by implementing the same functionality as described above, namely:
lua_check_client_abort
https://github.com/ikawrakow/ik_llama.cpp/issues/1020#issuecomment-3652600031
Yeah, it can be done for sure. If someone would be generous enough to look up the code of nginx, to cherry-pick the algo and implement it in the server of ik_llama.cpp that would be cool since that would solve the proper interruption problem.
@hksdpc255
ik_llama.cppneeds to interrupt disconnected request just like the mainlinellama.cppdo.
That is interesting. The mainline you're saying is already implemented the abovementioned functionality?
@hksdpc255
This is very intriguing! Apparently the mainline uses the httplib:
/opt/llama.cpp/llama.cpp# grep -rn -Fa --colour TCP_ --include=*.c*
ggml/src/ggml-rpc/ggml-rpc.cpp:306: // set TCP_NODELAY to disable Nagle's algorithm
ggml/src/ggml-rpc/ggml-rpc.cpp:307: int ret = setsockopt(sockfd, IPPROTO_TCP, TCP_NODELAY, (char *)&flag, sizeof(int));
ggml/src/ggml-rpc/ggml-rpc.cpp:325: GGML_LOG_ERROR("Failed to set TCP_NODELAY\n");
ggml/src/ggml-rpc/ggml-rpc.cpp:349: GGML_LOG_ERROR("Failed to set TCP_NODELAY\n");
vendor/cpp-httplib/httplib.cpp:1270: if (tcp_nodelay) { set_socket_opt(sock, IPPROTO_TCP, TCP_NODELAY, 1); }
while the ik_llama.cpp doesn't do that. namely:
ls -alh /opt/ik_llama.cpp/ik_llama.cpp/vendor/
total 24K
drwxr-xr-x 6 root root 4.0K Sep 28 04:35 .
drwxr-xr-x 27 root root 4.0K Dec 15 05:25 ..
drwxr-xr-x 2 root root 4.0K Sep 28 04:35 miniaudio
drwxr-xr-x 2 root root 4.0K Dec 15 05:25 minja
drwxr-xr-x 2 root root 4.0K Sep 28 04:35 nlohmann
drwxr-xr-x 2 root root 4.0K Sep 28 04:35 stb
that is, there is no dependency of httplib. That might be the issue.
@hksdpc255
Oh I think I see the issue:
5760 // Setup `is_connection_closed` method
5761 auto sock = strm.socket();
5762 req.is_connection_closed = [sock]() {
5763 return !detail::is_socket_alive(sock);
5764 };
So this setup is implemented in httplib. The similar functionality has to be implemented in ik_llama.cpp too.
I’m not familiar with the server code, so I’m unable to provide a fix myself. I can only report the issue for now.
@magikRUKKOLA
Yep, its not a problem of ik_llama.cpp for sure. Its a problem of a shitty code of a Roo Code
It's also showing this behavior on OpenWebUI - if I press the cancel button server keeps processing till the end. After this I can even press "try again" button which triggers a new request - but since the request context is exactly the same it just takes the answer from the ram cache and outputs it right away.
It can even be considered a feature for very long-running tasks (24h+) when all timeouts are already reached so you can retrieve the answer using a retry command (works in roocode the same way). However, it's also a bug that it doesn't stop generating when the cancel button gets pressed explicitly (tested in two different apps).