llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Misc. bug: OpenAI API v1/responses llama-server

Open foxbg opened this issue 4 months ago • 17 comments

Name and Version

.\llama-server --version ... version: 5902 (4a4f4269) built with clang version 19.1.5 for x86_64-pc-windows-msvc

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

.\llama-server -m Llama-3.2-3B-Instruct-Q6_K_L.gguf -ngl 33 --port 8081 --host 0.0.0.0

Problem description & steps to reproduce

When OpenAI compatible API is used and client uses v1/responses I get 404 Possibly not yet supported? ref: https://platform.openai.com/docs/api-reference/responses

First Bad Commit

Not sure

Relevant log output

Client
`POST "http://192.168.x.x:8081/v1/responses": 404 Not Found {"code":404,"message":"File Not Found","type":"not_found_error"`
Server

main: server is listening on http://0.0.0.0:8081 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/responses 192.168.x.x 404

foxbg avatar Jul 15 '25 21:07 foxbg

I don't believe that endpoint is available. If you go to server.cpp line 4834, you can see the registered endpoints. v1/responses is not one of them.

SomeOddCodeGuy avatar Jul 16 '25 03:07 SomeOddCodeGuy

Is there a plan to add this endpoint to improve OpenAI API compatibility?

foxbg avatar Jul 16 '25 18:07 foxbg

There is no plan atm, but we can create one. I haven't seen yet what are the changes introduced in /v1/responses, but unless there are some major blockers we should consider supporting it.

ggerganov avatar Jul 17 '25 05:07 ggerganov

As it is in API specification tools will start to use it. In my case it was fabric(https://github.com/danielmiessler/Fabric/pull/1559). I've opened a discussion there too https://github.com/danielmiessler/Fabric/issues/1625.

foxbg avatar Jul 17 '25 12:07 foxbg

Hi. I'm one of the maintainers of Fabric. For OpenAI compatible vendors I have a way to mark the ones that have /responses API implemented.

I'll get clarity with @foxbg on his use case and provide a solution for users with similar issues.

ksylvan avatar Jul 17 '25 12:07 ksylvan

@ggerganov

Here the description of /v1/responses https://platform.openai.com/docs/api-reference/responses/create

morozover avatar Jul 28 '25 15:07 morozover

+1 👍 to support the implementation of this endpoint.

My current use case is using fabric as well, and although there is a workaround, I would love if llama-server was more OpenAI compatible.

iansmathew avatar Aug 07 '25 12:08 iansmathew

The additon of the OpenAIs new responses API would be great.

+1 for this

dariomanda avatar Aug 09 '25 10:08 dariomanda

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Sep 24 '25 01:09 github-actions[bot]

I'd also be interested in this endpoint. The main benefit for me is not needing to worry about the template format.

https://github.com/mostlygeek/llama-swap/issues/266 also closed there's because this issue here is stale

Penagwin avatar Oct 08 '25 23:10 Penagwin

Openrouter has already beta support for responses endpoint. Can we reopen this issue @ggerganov ?

xcpky avatar Oct 23 '25 02:10 xcpky

issue definitely has enough conversation to reopen. and remove stalebot. stalebot is always a mistake.

yggdrasil75 avatar Oct 30 '25 11:10 yggdrasil75

If somebody starts working on this, please post an update here.

ggerganov avatar Oct 30 '25 11:10 ggerganov

Key differences: https://platform.openai.com/docs/guides/migrate-to-responses

Things need to be done:

1. Require explicit store: false OpenAI Response API defaults store to true. AFAIK llama.cpp does not handle states, so documentation and assertions are needed.

2. Look for input instead of messages in the request body https://platform.openai.com/docs/guides/migrate-to-responses#1-update-generation-endpoints https://github.com/ggml-org/llama.cpp/blob/1f5accb8d0056e6099cd5b772b1cb787dd590a13/tools/server/utils.hpp#L584-L591 instructions field requires manual concatenation.

3. Change structure of non-streaming response https://platform.openai.com/docs/api-reference/responses/create https://github.com/ggml-org/llama.cpp/blob/1f5accb8d0056e6099cd5b772b1cb787dd590a13/tools/server/server.cpp#L924-L936

4. Change structure of streaming response https://platform.openai.com/docs/api-reference/responses-streaming https://github.com/ggml-org/llama.cpp/blob/1f5accb8d0056e6099cd5b772b1cb787dd590a13/tools/server/server.cpp#L956-L972 (server_task_result_cmpl_partial not mentioned)

The response API sends event field of SSEs; see the response below.

Example response
curl https://api.openai.com/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-proj-" \
  -d '{
    "model": "gpt-4.1-mini",
    "input": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Hello!"}
      ],
    "stream": true
  }'

event: response.created
data: {"type":"response.created","sequence_number":0,"response":{"id":"resp_0fda4fc62","object":"response","created_at":1762233198,"status":"in_progress","background":false,"error":null,"incomplete_details":null,"instructions":null,"max_output_tokens":null,"max_tool_calls":null,"model":"gpt-4.1-mini-2025-04-14","output":[],"parallel_tool_calls":true,"previous_response_id":null,"prompt_cache_key":null,"prompt_cache_retention":null,"reasoning":{"effort":null,"summary":null},"safety_identifier":null,"service_tier":"auto","store":true,"temperature":1.0,"text":{"format":{"type":"text"},"verbosity":"medium"},"tool_choice":"auto","tools":[],"top_logprobs":0,"top_p":1.0,"truncation":"disabled","usage":null,"user":null,"metadata":{}}}

event: response.in_progress
data: {"type":"response.in_progress","sequence_number":1,"response":{"id":"resp_0fda4fc62","object":"response","created_at":1762233198,"status":"in_progress","background":false,"error":null,"incomplete_details":null,"instructions":null,"max_output_tokens":null,"max_tool_calls":null,"model":"gpt-4.1-mini-2025-04-14","output":[],"parallel_tool_calls":true,"previous_response_id":null,"prompt_cache_key":null,"prompt_cache_retention":null,"reasoning":{"effort":null,"summary":null},"safety_identifier":null,"service_tier":"auto","store":true,"temperature":1.0,"text":{"format":{"type":"text"},"verbosity":"medium"},"tool_choice":"auto","tools":[],"top_logprobs":0,"top_p":1.0,"truncation":"disabled","usage":null,"user":null,"metadata":{}}}

event: response.output_item.added
data: {"type":"response.output_item.added","sequence_number":2,"output_index":0,"item":{"id":"msg_0fda4fc6245f45e","type":"message","status":"in_progress","content":[],"role":"assistant"}}

event: response.content_part.added
data: {"type":"response.content_part.added","sequence_number":3,"item_id":"msg_0fda4fc6245f45e","output_index":0,"content_index":0,"part":{"type":"output_text","annotations":[],"logprobs":[],"text":""}}

event: response.output_text.delta
data: {"type":"response.output_text.delta","sequence_number":4,"item_id":"msg_0fda4fc6245f45e","output_index":0,"content_index":0,"delta":"Hello","logprobs":[],"obfuscation":"LMJTZbI0Ak2"}

...

event: response.output_text.delta
data: {"type":"response.output_text.delta","sequence_number":12,"item_id":"msg_0fda4fc6245f45e","output_index":0,"content_index":0,"delta":"?","logprobs":[],"obfuscation":"lMwKYBlNAN06XCB"}

event: response.output_text.done
data: {"type":"response.output_text.done","sequence_number":13,"item_id":"msg_0fda4fc6245f45e","output_index":0,"content_index":0,"text":"Hello! How can I assist you today?","logprobs":[]}

event: response.content_part.done
data: {"type":"response.content_part.done","sequence_number":14,"item_id":"msg_0fda4fc6245f45e","output_index":0,"content_index":0,"part":{"type":"output_text","annotations":[],"logprobs":[],"text":"Hello! How can I assist you today?"}}

event: response.output_item.done
data: {"type":"response.output_item.done","sequence_number":15,"output_index":0,"item":{"id":"msg_0fda4fc6245f45e","type":"message","status":"completed","content":[{"type":"output_text","annotations":[],"logprobs":[],"text":"Hello! How can I assist you today?"}],"role":"assistant"}}

event: response.completed
data: {"type":"response.completed","sequence_number":16,"response":{"id":"resp_0fda4fc62","object":"response","created_at":1762233198,"status":"completed","background":false,"error":null,"incomplete_details":null,"instructions":null,"max_output_tokens":null,"max_tool_calls":null,"model":"gpt-4.1-mini-2025-04-14","output":[{"id":"msg_0fda4fc6245f45e","type":"message","status":"completed","content":[{"type":"output_text","annotations":[],"logprobs":[],"text":"Hello! How can I assist you today?"}],"role":"assistant"}],"parallel_tool_calls":true,"previous_response_id":null,"prompt_cache_key":null,"prompt_cache_retention":null,"reasoning":{"effort":null,"summary":null},"safety_identifier":null,"service_tier":"default","store":true,"temperature":1.0,"text":{"format":{"type":"text"},"verbosity":"medium"},"tool_choice":"auto","tools":[],"top_logprobs":0,"top_p":1.0,"truncation":"disabled","usage":{"input_tokens":19,"input_tokens_details":{"cached_tokens":0},"output_tokens":10,"output_tokens_details":{"reasoning_tokens":0},"total_tokens":29},"user":null,"metadata":{}}}

5. Change the way of parsing tools https://platform.openai.com/docs/guides/migrate-to-responses#5-update-function-definitions https://github.com/ggml-org/llama.cpp/blob/1f5accb8d0056e6099cd5b772b1cb787dd590a13/common/chat.cpp#L354-L377

Haven't checked multimodal stuff.

openingnow avatar Nov 04 '25 11:11 openingnow

VS Code Insights supports custom LLMs with OpenAI connection types and it uses responses API therefore llama.cpp can not be used with vscode.

MuhammedKalkan avatar Nov 04 '25 19:11 MuhammedKalkan

I just found out that OpenAI has a responses API implementation in the official gpt-oss repository. It could potentially be used as a reference for llama.cpp: https://github.com/openai/gpt-oss/tree/main/gpt_oss/responses_api

This implementation supports multiple inference backends, including ollama API.

So I went ahead and adapted the ollama backend for llama-server using the raw /completions endpoint: https://github.com/openai/gpt-oss/pull/225

Already tested with codex and seems to work fine.

To run it: python -m gpt_oss.responses_api.serve --checkpoint http://127.0.0.1:8080 --inference-backend llamacpp_server where "checkpoint" is the llama-server URL

tarruda avatar Nov 06 '25 01:11 tarruda

I roughly implemented text completions.

https://github.com/openingnow/llama.cpp/commit/df53bfe2f173ae5c41ae0545c47ed93f75fc50c2

I need to decide whether the code for /v1/chat/completions and /v1/responses should be separated before proceeding further. i.e., keeping to_json_oaicompat_response separate from to_json_oaicompat_chat.

Since the structures of two APIs are slightly different, managing all functions in one codebase can be hard. Variables like is_reponse is needed everywhere, even in the common_chat_tools_parse_oaicompat of common/chat.cpp. However, it might not be that difficult since llama.cpp doesn't support all the features that OpenAI's response API provides - to me, it is hard to predict.

Maybe we can freeze and deprecate /v1/chat/completions and drop support in the future (say, 6 months from now).

Any long-term plans? @ggerganov

openingnow avatar Nov 08 '25 15:11 openingnow