llama.cpp
llama.cpp copied to clipboard
Misc. bug: OpenAI API v1/responses llama-server
Name and Version
.\llama-server --version ... version: 5902 (4a4f4269) built with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
.\llama-server -m Llama-3.2-3B-Instruct-Q6_K_L.gguf -ngl 33 --port 8081 --host 0.0.0.0
Problem description & steps to reproduce
When OpenAI compatible API is used and client uses v1/responses I get 404 Possibly not yet supported? ref: https://platform.openai.com/docs/api-reference/responses
First Bad Commit
Not sure
Relevant log output
Client
`POST "http://192.168.x.x:8081/v1/responses": 404 Not Found {"code":404,"message":"File Not Found","type":"not_found_error"`
Server
main: server is listening on http://0.0.0.0:8081 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/responses 192.168.x.x 404
I don't believe that endpoint is available. If you go to server.cpp line 4834, you can see the registered endpoints. v1/responses is not one of them.
Is there a plan to add this endpoint to improve OpenAI API compatibility?
There is no plan atm, but we can create one. I haven't seen yet what are the changes introduced in /v1/responses, but unless there are some major blockers we should consider supporting it.
As it is in API specification tools will start to use it. In my case it was fabric(https://github.com/danielmiessler/Fabric/pull/1559). I've opened a discussion there too https://github.com/danielmiessler/Fabric/issues/1625.
Hi. I'm one of the maintainers of Fabric. For OpenAI compatible vendors I have a way to mark the ones that have /responses API implemented.
I'll get clarity with @foxbg on his use case and provide a solution for users with similar issues.
@ggerganov
Here the description of /v1/responses
https://platform.openai.com/docs/api-reference/responses/create
+1 👍 to support the implementation of this endpoint.
My current use case is using fabric as well, and although there is a workaround, I would love if llama-server was more OpenAI compatible.
The additon of the OpenAIs new responses API would be great.
+1 for this
This issue was closed because it has been inactive for 14 days since being marked as stale.
I'd also be interested in this endpoint. The main benefit for me is not needing to worry about the template format.
https://github.com/mostlygeek/llama-swap/issues/266 also closed there's because this issue here is stale
Openrouter has already beta support for responses endpoint. Can we reopen this issue @ggerganov ?
issue definitely has enough conversation to reopen. and remove stalebot. stalebot is always a mistake.
If somebody starts working on this, please post an update here.
Key differences: https://platform.openai.com/docs/guides/migrate-to-responses
Things need to be done:
1. Require explicit store: false
OpenAI Response API defaults store to true. AFAIK llama.cpp does not handle states, so documentation and assertions are needed.
2. Look for input instead of messages in the request body
https://platform.openai.com/docs/guides/migrate-to-responses#1-update-generation-endpoints
https://github.com/ggml-org/llama.cpp/blob/1f5accb8d0056e6099cd5b772b1cb787dd590a13/tools/server/utils.hpp#L584-L591
instructions field requires manual concatenation.
3. Change structure of non-streaming response https://platform.openai.com/docs/api-reference/responses/create https://github.com/ggml-org/llama.cpp/blob/1f5accb8d0056e6099cd5b772b1cb787dd590a13/tools/server/server.cpp#L924-L936
4. Change structure of streaming response
https://platform.openai.com/docs/api-reference/responses-streaming
https://github.com/ggml-org/llama.cpp/blob/1f5accb8d0056e6099cd5b772b1cb787dd590a13/tools/server/server.cpp#L956-L972
(server_task_result_cmpl_partial not mentioned)
The response API sends event field of SSEs; see the response below.
Example response
curl https://api.openai.com/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-proj-" \
-d '{
"model": "gpt-4.1-mini",
"input": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"stream": true
}'
event: response.created
data: {"type":"response.created","sequence_number":0,"response":{"id":"resp_0fda4fc62","object":"response","created_at":1762233198,"status":"in_progress","background":false,"error":null,"incomplete_details":null,"instructions":null,"max_output_tokens":null,"max_tool_calls":null,"model":"gpt-4.1-mini-2025-04-14","output":[],"parallel_tool_calls":true,"previous_response_id":null,"prompt_cache_key":null,"prompt_cache_retention":null,"reasoning":{"effort":null,"summary":null},"safety_identifier":null,"service_tier":"auto","store":true,"temperature":1.0,"text":{"format":{"type":"text"},"verbosity":"medium"},"tool_choice":"auto","tools":[],"top_logprobs":0,"top_p":1.0,"truncation":"disabled","usage":null,"user":null,"metadata":{}}}
event: response.in_progress
data: {"type":"response.in_progress","sequence_number":1,"response":{"id":"resp_0fda4fc62","object":"response","created_at":1762233198,"status":"in_progress","background":false,"error":null,"incomplete_details":null,"instructions":null,"max_output_tokens":null,"max_tool_calls":null,"model":"gpt-4.1-mini-2025-04-14","output":[],"parallel_tool_calls":true,"previous_response_id":null,"prompt_cache_key":null,"prompt_cache_retention":null,"reasoning":{"effort":null,"summary":null},"safety_identifier":null,"service_tier":"auto","store":true,"temperature":1.0,"text":{"format":{"type":"text"},"verbosity":"medium"},"tool_choice":"auto","tools":[],"top_logprobs":0,"top_p":1.0,"truncation":"disabled","usage":null,"user":null,"metadata":{}}}
event: response.output_item.added
data: {"type":"response.output_item.added","sequence_number":2,"output_index":0,"item":{"id":"msg_0fda4fc6245f45e","type":"message","status":"in_progress","content":[],"role":"assistant"}}
event: response.content_part.added
data: {"type":"response.content_part.added","sequence_number":3,"item_id":"msg_0fda4fc6245f45e","output_index":0,"content_index":0,"part":{"type":"output_text","annotations":[],"logprobs":[],"text":""}}
event: response.output_text.delta
data: {"type":"response.output_text.delta","sequence_number":4,"item_id":"msg_0fda4fc6245f45e","output_index":0,"content_index":0,"delta":"Hello","logprobs":[],"obfuscation":"LMJTZbI0Ak2"}
...
event: response.output_text.delta
data: {"type":"response.output_text.delta","sequence_number":12,"item_id":"msg_0fda4fc6245f45e","output_index":0,"content_index":0,"delta":"?","logprobs":[],"obfuscation":"lMwKYBlNAN06XCB"}
event: response.output_text.done
data: {"type":"response.output_text.done","sequence_number":13,"item_id":"msg_0fda4fc6245f45e","output_index":0,"content_index":0,"text":"Hello! How can I assist you today?","logprobs":[]}
event: response.content_part.done
data: {"type":"response.content_part.done","sequence_number":14,"item_id":"msg_0fda4fc6245f45e","output_index":0,"content_index":0,"part":{"type":"output_text","annotations":[],"logprobs":[],"text":"Hello! How can I assist you today?"}}
event: response.output_item.done
data: {"type":"response.output_item.done","sequence_number":15,"output_index":0,"item":{"id":"msg_0fda4fc6245f45e","type":"message","status":"completed","content":[{"type":"output_text","annotations":[],"logprobs":[],"text":"Hello! How can I assist you today?"}],"role":"assistant"}}
event: response.completed
data: {"type":"response.completed","sequence_number":16,"response":{"id":"resp_0fda4fc62","object":"response","created_at":1762233198,"status":"completed","background":false,"error":null,"incomplete_details":null,"instructions":null,"max_output_tokens":null,"max_tool_calls":null,"model":"gpt-4.1-mini-2025-04-14","output":[{"id":"msg_0fda4fc6245f45e","type":"message","status":"completed","content":[{"type":"output_text","annotations":[],"logprobs":[],"text":"Hello! How can I assist you today?"}],"role":"assistant"}],"parallel_tool_calls":true,"previous_response_id":null,"prompt_cache_key":null,"prompt_cache_retention":null,"reasoning":{"effort":null,"summary":null},"safety_identifier":null,"service_tier":"default","store":true,"temperature":1.0,"text":{"format":{"type":"text"},"verbosity":"medium"},"tool_choice":"auto","tools":[],"top_logprobs":0,"top_p":1.0,"truncation":"disabled","usage":{"input_tokens":19,"input_tokens_details":{"cached_tokens":0},"output_tokens":10,"output_tokens_details":{"reasoning_tokens":0},"total_tokens":29},"user":null,"metadata":{}}}
5. Change the way of parsing tools https://platform.openai.com/docs/guides/migrate-to-responses#5-update-function-definitions https://github.com/ggml-org/llama.cpp/blob/1f5accb8d0056e6099cd5b772b1cb787dd590a13/common/chat.cpp#L354-L377
Haven't checked multimodal stuff.
VS Code Insights supports custom LLMs with OpenAI connection types and it uses responses API therefore llama.cpp can not be used with vscode.
I just found out that OpenAI has a responses API implementation in the official gpt-oss repository. It could potentially be used as a reference for llama.cpp: https://github.com/openai/gpt-oss/tree/main/gpt_oss/responses_api
This implementation supports multiple inference backends, including ollama API.
So I went ahead and adapted the ollama backend for llama-server using the raw /completions endpoint: https://github.com/openai/gpt-oss/pull/225
Already tested with codex and seems to work fine.
To run it: python -m gpt_oss.responses_api.serve --checkpoint http://127.0.0.1:8080 --inference-backend llamacpp_server where "checkpoint" is the llama-server URL
I roughly implemented text completions.
https://github.com/openingnow/llama.cpp/commit/df53bfe2f173ae5c41ae0545c47ed93f75fc50c2
I need to decide whether the code for /v1/chat/completions and /v1/responses should be separated before proceeding further.
i.e., keeping to_json_oaicompat_response separate from to_json_oaicompat_chat.
Since the structures of two APIs are slightly different, managing all functions in one codebase can be hard. Variables like is_reponse is needed everywhere, even in the common_chat_tools_parse_oaicompat of common/chat.cpp. However, it might not be that difficult since llama.cpp doesn't support all the features that OpenAI's response API provides - to me, it is hard to predict.
Maybe we can freeze and deprecate /v1/chat/completions and drop support in the future (say, 6 months from now).
Any long-term plans? @ggerganov