LocalAI Enabling prompt_cache make LocalAI panic

LocalAI version: v0.18.0

Environment, CPU architecture, OS, and Version:

Describe the bug

To Reproduce Enable prompt_cache_all and set prompt_cache_path to a llama.cpp compatible model

Expected behavior

Logs N/A (will collect soon)

Additional context

Jun 09 '23 17:06 mudler

:warning::warning::warning::warning::warning:

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful :smile_cat: and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me, I don't support (yet) replying!

:warning::warning::warning::warning::warning:

Hello! I'm sorry to hear that you are experiencing a bug with LocalAI. Can you please provide more information on the issue you are facing? What is the exact error message or behavior that you are observing? This will help us better understand the problem and provide a solution.

Sources:

https://localai.io/basics/getting_started/index.html
https://localai.io/basics/news/index.html
https://github.com/go-skynet/LocalAI/tree/master/.github/ISSUE_TEMPLATE/bug_report.md

Jul 18 '23 21:07 localai-bot

I have the same issue. On LocalAI v3.5.0 and using gemma-3-4b-it-qat (on llama-cpp) is fine except when I update the model with

prompt_cache_path: cache
prompt_cache_all: true

You can see that cpu-llama-cpp is crashing on some assertion:

Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stdout [DEBUG] Waiting for results...
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr /LocalAI/backend/cpp/llama-cpp-avx512-build/llama.cpp/tools/grpc-server/server.cpp:3401: pos_min == -1, but n_past > 0 - should not happen: https://github.com/ggml-org/llama.cpp/pull/13833#discussion_r2116181237
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr slot launch_slot_: id  1 | task 29 | processing task
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr slot update_slots: id  1 | task 29 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 303
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr slot update_slots: id  1 | task 29 | n_past = 303, cache_tokens.size() = 330, seq_id = 1, pos_min = -1
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr /localai/backends/cpu-llama-cpp/llama-cpp-avx512(+0x84739b)[0x7dc09c84739b]
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr /localai/backends/cpu-llama-cpp/llama-cpp-avx512(+0x84795f)[0x7dc09c84795f]
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr /localai/backends/cpu-llama-cpp/llama-cpp-avx512(+0x847b2e)[0x7dc09c847b2e]
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr /localai/backends/cpu-llama-cpp/llama-cpp-avx512(+0x188d25)[0x7dc09c188d25]
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr /localai/backends/cpu-llama-cpp/llama-cpp-avx512(+0x12f6cd)[0x7dc09c12f6cd]
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr /localai/backends/cpu-llama-cpp/llama-cpp-avx512(+0x1042a0)[0x7dc09c1042a0]
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr /localai/backends/cpu-llama-cpp/lib/libc.so.6(+0x29d90)[0x7dc09b829d90]
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr /localai/backends/cpu-llama-cpp/lib/libc.so.6(__libc_start_main+0x80)[0x7dc09b829e40]
Sep 10 14:14:30 localai local-ai[151]: 2:14PM DBG GRPC(gemma-3-4b-it-qat-127.0.0.1:38149): stderr /localai/backends/cpu-llama-cpp/llama-cpp-avx512(+0x1147e5)[0x7dc09c1147e5]
Sep 10 14:14:30 localai local-ai[151]: 2:14PM ERR Server error error="rpc error: code = Unavailable desc = error reading from server: EOF" ip=192.168.1.5 latency=855.504235ms method=POST status=500 url=/v1/chat/completions

Sep 10 '25 16:09 imkira

Is there any progress on this end by any chance? Currently the lack of prompt caching prevents having a longer conversation history in real-time apps and results in higher latency in general, particularly with tool usage since the whole system prompt including all tool/function definitions needs to be processed again every request.

Dec 02 '25 02:12 mgoltzsche

I found a corresponding issue within the llama.cpp project: https://github.com/ggml-org/llama.cpp/issues/17118

Dec 11 '25 20:12 mgoltzsche