llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

feat: add reasoning and reasoning_content fields to OpenAI message types

Open cdoern opened this issue 1 month ago • 1 comments

What does this PR do?

Add support for reasoning fields in OpenAI-compatible chat completion messages to enable compatibility with vLLM reasoning parsers.

Changes:

  • Add reasoning_content and reasoning fields to OpenAIAssistantMessageParam
  • Add reasoning field to OpenAIChoiceDelta (reasoning_content already existed)

Both field names are supported for maximum compatibility:

  • reasoning_content: Used by vLLM ≤ v0.8.4
  • reasoning: New field name in vLLM ≥ v0.9.x (based on release notes)

vLLM documentation recommends migrating to the shorter reasoning field name, but maintains backward compatibility with reasoning_content.

These fields allow reasoning models to return their chain-of-thought process alongside the final answer, which is crucial for transparency and debugging with reasoning models.

References:

  • vLLM Reasoning Outputs: https://docs.vllm.ai/en/stable/features/reasoning_outputs/
  • vLLM Issue #12468: https://github.com/vllm-project/vllm/issues/12468

Test Plan

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --reasoning-parser deepseek_r1
  
llama stack run starter

curl http://localhost:8321/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
      "messages": [
        {"role": "user", "content": "What is 25 * 4?"}
      ]
    }'

{"id":"chatcmpl-9df9d2a5f849bbe0","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"\n\nTo calculate \\(25 \\times 4\\), follow these easy steps:\n\n1. **Understand the Multiplication:**\n   \n   \\(25 \\times 4\\) means you are adding the number 25 four times.\n   \n   \\[\n   25 + 25 + 25 + 25 = 100\n   \\]\n\n2. **Break Down the Multiplication:**\n   \n   - Multiply 25 by 2:\n     \\[\n     25 \\times 2 = 50\n     \\]\n   - Then multiply the result by 2:\n     \\[\n     50 \\times 2 = 100\n     \\]\n\n3. **Final Answer:**\n   \n   \\[\n   \\boxed{100}\n   \\]","refusal":null,"role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"reasoning":"To solve 25 multiplied by 4, I start by recognizing that 25 is a quarter of 100. Multiplying 25 by 4 is the same as finding a quarter of 100 multiplied by 4, which equals 100.\n\nNext, I can consider that 25 multiplied by 4 is also equal to 25 multiplied by 2, which is 50, and then multiplied by 2 again, resulting in 100.\n\nAlternatively, I can use the distributive property by breaking down 4 into 3 and 1, so 25 multiplied by 3 is 75, and 25 multiplied by 1 is 25. Adding these together gives 100.\n\nBoth methods lead to the same result, confirming that 25 multiplied by 4 equals 100.\n","reasoning_content":"To solve 25 multiplied by 4, I start by recognizing that 25 is a quarter of 100. Multiplying 25 by 4 is the same as finding a quarter of 100 multiplied by 4, which equals 100.\n\nNext, I can consider that 25 multiplied by 4 is also equal to 25 multiplied by 2, which is 50, and then multiplied by 2 again, resulting in 100.\n\nAlternatively, I can use the distributive property by breaking down 4 into 3 and 1, so 25 multiplied by 3 is 75, and 25 multiplied by 1 is 25. Adding these together gives 100.\n\nBoth methods lead to the same result, confirming that 25 multiplied by 4 equals 100.\n"},"stop_reason":null,"token_ids":null}],"created":1764187386,"model":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":356,"prompt_tokens":14,"total_tokens":370,"completion_tokens_details":null,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null,"metrics":[{"trace_id":"9ed1630440cb1e923916455d98663df3","span_id":"a27b4cb4208ed39f","timestamp":"2025-11-26T20:03:19.089063Z","attributes":{"model_id":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","provider_id":"vllm"},"type":"metric","metric":"prompt_tokens","value":14,"unit":"tokens"},{"trace_id":"9ed1630440cb1e923916455d98663df3","span_id":"a27b4cb4208ed39f","timestamp":"2025-11-26T20:03:19.089072Z","attributes":{"model_id":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","provider_id":"vllm"},"type":"metric","metric":"completion_tokens","value":356,"unit":"tokens"},{"trace_id":"9ed1630440cb1e923916455d98663df3","span_id":"a27b4cb4208ed39f","timestamp":"2025-11-26T20:03:19.089075Z","attributes":{"model_id":"vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B","provider_id":"vllm"},"type":"metric","metric":"total_tokens","value":370,"unit":"tokens"}]}%

cdoern avatar Nov 26 '25 19:11 cdoern

this one can wait until CI is back, want to make sure this doesnt break engines which don't support the field.

cdoern avatar Nov 26 '25 20:11 cdoern

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat: add reasoning and reasoning_content fields to OpenAI message types

Edit this comment to update it. It will appear in the SDK's changelogs.

⚠️ llama-stack-client-node studio · code · diff

There was a regression in your SDK. generate ⚠️build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/llama-stack-client-node/70bd51e76d5f76cbc45e6601ae9bca4c451e5ea0/dist.tar.gz
New diagnostics (5 warning)
⚠️ Python/DuplicateDeclaration: We generated two separated types under the same name: `InputOpenAIResponseMessageOutput`. If they are the referring to the same type, they should be extracted to the same ref and be declared as a model. Otherwise, they should be renamed with `x-stainless-naming`
⚠️ Python/DuplicateDeclaration: We generated two separated types under the same name: `InputListOpenAIResponseMessageUnionOpenAIResponseInputFunctionToolCallOutputOpenAIResponseMessageInput`. If they are the referring to the same type, they should be extracted to the same ref and be declared as a model. Otherwise, they should be renamed with `x-stainless-naming`
⚠️ Python/DuplicateDeclaration: We generated two separated types under the same name: `DataOpenAIResponseMessageOutput`. If they are the referring to the same type, they should be extracted to the same ref and be declared as a model. Otherwise, they should be renamed with `x-stainless-naming`
⚠️ Python/NameNotAllowed: Encountered response property `model_type` which may conflict with Pydantic properties.

Pydantic uses model_ as a protected namespace that shouldn't be used for attributes of our own API's models. Please rename it using the 'renameValue' transform.

⚠️ Python/NameNotAllowed: Encountered response property `model_type` which may conflict with Pydantic properties.

Pydantic uses model_ as a protected namespace that shouldn't be used for attributes of our own API's models. Please rename it using the 'renameValue' transform.

⚠️ llama-stack-client-kotlin studio · code · diff

There was a regression in your SDK. generate ⚠️lint ❗ (prev: lint ✅) → test ❗

llama-stack-client-go studio · code · diff

Your SDK built successfully. generate ⚠️lint ❗test ❗

go get github.com/stainless-sdks/llama-stack-client-go@64d5dda5cfc86540ad8fdc7fcb937cbe9f93e9ce
llama-stack-client-python studio · code · diff

generate ⚠️build ⏳lint ⏳test ⏳

⏳ These are partial results; builds are still running.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-12-01 21:24:47 UTC

github-actions[bot] avatar Dec 01 '25 19:12 github-actions[bot]

@mattf:

vllm 0.9.0 was released in may 2025, let's encourage users to upgrade to it and add backward compat on request

this is fair, so does that mean I should only do reasoning here if we do this?

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability.

Perhaps there is a cleaner way to do this as I suggested above with some custom message classes per-provider? I looked into this because @bbrowning actually caught it when running some tests in vLLM from Llama stack. Ben do you have any opinions here on the importance of this?

cdoern avatar Dec 02 '25 15:12 cdoern

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability.

we allow users to pass in params that we simply forward to the backend inference provider. we could formally do the same for output. at least the openai-python sdk will let you do this.

$ nc -l 8000 <<EOF
HTTP/1.1 200 OK
Content-Type: application/json

{
  "id": "fake-123",
  "object": "chat.completion",
  "model": "ignored",
  "choices": [{"index": 0, "message": {"role": "assistant", "content": "Hello from the fake model!"}, "finish_reason": "stop"}],
  "extra_info": {"foo": "bar", "latency_ms": 12}
}
EOF
$ uv run --with openai python -c 'from openai import OpenAI; response = OpenAI(base_url="http://127.0.0.1:8000", api_key="dummy").chat.completions.create(model="ignored", messages=[{"role": "user", "content": "Hello?"}]); print(response.extra_info)'
{'foo': 'bar', 'latency_ms': 12}

mattf avatar Dec 02 '25 15:12 mattf

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability.

Perhaps there is a cleaner way to do this as I suggested above with some custom message classes per-provider? I looked into this because @bbrowning actually caught it when running some tests in vLLM from Llama stack. Ben do you have any opinions here on the importance of this?

Here are two principles I believe are important for widespread Llama Stack adoption in the inference space:

  1. Running a benchmark, application, or agent directly against an inference provider compared to running that same benchmark/app/agent against Llama Stack proxying to that backend inference provider should not result in a loss of model accuracy for the use-case in question. To put it another way, there should not be an accuracy penalty to using Llama Stack compared to using the inference provider directly without Llama Stack.

  2. I also believe that, at least for OpenAI-compatible inference APIs (Completions, Chat Completions, Responses), that clients should be able to talk directly to the backend inference provider or to Llama Stack proxying that inference provider without changes. I shouldn't have to change my query parameters or get different fields returned in the response, for example.

It's out of these principles that I support Llama Stack being able to accept any Chat Completions input that a backend inference provider can handle and return any response fields the backend inference provider can return. And, specifically in this case, that means getting reasoning content in responses and being able to send that reasoning content back in subsequent requests to Chat Completions, as is required to get maximal multi-turn accuracy with reasoning models.

Asking users to instead rewrite their app/agent to use Responses violates principle 2 above, so I do not think that should be our answer.

bbrowning avatar Dec 02 '25 15:12 bbrowning

we should encourage users who want separated reasoning details to use /v1/responses w/ reasoning={"summary": ...}

I think that is also fair, but I also think that not having parity with what vLLM (or any inference engine) supports for v1/chat/completions is a reason for people to not use llama stack, why would someone use LLS if they lose functionality as compared to using vLLM directly? same goes for the gemini thought_signature, etc. This to me seems like a gap we need to close for usability. Perhaps there is a cleaner way to do this as I suggested above with some custom message classes per-provider? I looked into this because @bbrowning actually caught it when running some tests in vLLM from Llama stack. Ben do you have any opinions here on the importance of this?

Here are two principles I believe are important for widespread Llama Stack adoption in the inference space:

1. Running a benchmark, application, or agent directly against an inference provider compared to running that same benchmark/app/agent against Llama Stack proxying to that backend inference provider should not result in a loss of model accuracy for the use-case in question. To put it another way, there should not be an accuracy penalty to using Llama Stack compared to using the inference provider directly without Llama Stack.

2. I also believe that, at least for OpenAI-compatible inference APIs (Completions, Chat Completions, Responses), that clients should be able to talk directly to the backend inference provider or to Llama Stack proxying that inference provider without changes. I shouldn't have to change my query parameters or get different fields returned in the response, for example.

It's out of these principles that I support Llama Stack being able to accept any Chat Completions input that a backend inference provider can handle and return any response fields the backend inference provider can return. And, specifically in this case, that means getting reasoning content in responses and being able to send that reasoning content back in subsequent requests to Chat Completions, as is required to get maximal multi-turn accuracy with reasoning models.

Asking users to instead rewrite their app/agent to use Responses violates principle 2 above, so I do not think that should be our answer.

those are great principles. (2) especially makes sense for compliant APIs.

we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api. if we codify a vllm-specific extension in our api, then a user using another backend w/ a different proprietary extension for a similar feature would have to re-write when switching to llama stack.

mattf avatar Dec 02 '25 15:12 mattf

we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api

While this is true, I don't think people would be broken by these optional extensions. Like if ollama doesn't support reasoning, a user does not need to pass it in or receive the output but can still use this OpenAI message type.

The more I talk about this though, maybe the better solution here is to not have these at the top level of our inference API and instead do some sort of provider specific inherited version of these types? I can think though what that'd look like if folks would prefer @mattf @bbrowning ?

cdoern avatar Dec 02 '25 16:12 cdoern

those are great principles. (2) especially makes sense for compliant APIs.

we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api. if we codify a vllm-specific extension in our api, then a user using another backend w/ a different proprietary extension for a similar feature would have to re-write when switching to llama stack.

This is a good point, and I'd generally agree that being able to pass through additional parameters from clients to backend inference and from backend inference to clients does not mean that we have to expose those additional parameters as first-class parts of our own API surface. This would lead to us having to expose an API that is a superset of every possible backend provider's special parameters.

However, it does mean that our Pydantic validation has to be loose enough to accept things like arbitrary (or provider-specific, as Charlie suggests) extra fields on messages in Chat Completion requests and pass back arbitrary extra fields in Chat Completion responses.

bbrowning avatar Dec 02 '25 18:12 bbrowning

those are great principles. (2) especially makes sense for compliant APIs. we need to be careful in applying (2). in this case, vllm has a proprietary extension to the api. if we codify a vllm-specific extension in our api, then a user using another backend w/ a different proprietary extension for a similar feature would have to re-write when switching to llama stack.

This is a good point, and I'd generally agree that being able to pass through additional parameters from clients to backend inference and from backend inference to clients does not mean that we have to expose those additional parameters as first-class parts of our own API surface. This would lead to us having to expose an API that is a superset of every possible backend provider's special parameters.

However, it does mean that our Pydantic validation has to be loose enough to accept things like arbitrary (or provider-specific, as Charlie suggests) extra fields on messages in Chat Completion requests and pass back arbitrary extra fields in Chat Completion responses.

spot on. the unvalidated i/o path opens the user to more risk by helping tie apps to a specific stack configurations.

i'm -0 on this. if someone is going to do it, please take the unvalidate path so we avoid codifying provider specific implementation details in the public api.

hopefully the user is ok moving to /v1/responses.

mattf avatar Dec 03 '25 12:12 mattf

Summarizing this discussion so far:

  1. we should not update our public API to add reasoning details in chat completions
  2. while still maintaining (1), we should allow for the transport for provider-specific fields (for example to vLLM and Gemini as specific examples) in both directions.

ashwinb avatar Dec 03 '25 18:12 ashwinb