vllm
vllm copied to clipboard
[V1] Structured Outputs + Thinking compatibility
This PR brings thinking support to structured outputs in V1. Currently, if you want to use thinking parser in conjunction with structured outputs, you have to use the V0 engine.
This is also compatible with speculative decoding
This PR also refactor the tokenizer onto the structured_output_manager in order to construct the reasoner.
I have also added tests to cover this case.
Tests with the following:
# thinking + structured outputs
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --guided-decoding-backend xgrammar --reasoning-parser deepseek_r1
# thinking + ngram + structured outputs
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --guided-decoding-backend xgrammar --reasoning-parser deepseek_r1 --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 1}
Closes #14727
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
cc @gaocegege might be interested
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Hi @aarnphm, thanks for the PR and all the work on this!
I was testing the modifications using vllm serve and had a quick question.
I ran the server with:
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--enable-reasoning --reasoning-parser deepseek_r1 \
--tensor-parallel-size 4 \
--max-model-len 32768
And used the following client code:
class CarType(str, Enum):
sedan = "sedan"
suv = "SUV"
truck = "Truck"
coupe = "Coupe"
class CarDescription(BaseModel):
brand: str
model: str
car_type: CarType
json_schema = CarDescription.model_json_schema()
prompt = ("Generate a JSON with the brand, model, and car_type of "
"the most iconic car from the 90's")
completion = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
messages=[{
"role": "user",
"content": prompt,
}],
extra_body={"guided_json": json_schema},
)
print("reasoning_content:", completion.choices[0].message.reasoning_content)
print("content:", completion.choices[0].message.content)
On the first inference, everything works perfectly — I receive both the reasoning and the generated JSON output.
However, on subsequent inferences (even with different inputs), I notice that reasoning_content is an empty string and content is None.
Here’s what I observed:
-
First inference:
reasoning_content: Okay, so I need to generate a JSON that includes the brand, model, and car type of the most iconic car from the 90's. Hmm, where do I start? First, I should figure out which car is considered the most iconic from that decade. ... [omitting some text for brevity] ... Putting it all together, the JSON should include brand: "Nissan", model: "Skyline GT-R", and car_type: "Performance Coupe". That seems to cover the most iconic aspects of the car from the 90s. content: { "brand": "Nissan", "model": "Skyline GT-R", "car_type": "SUV" } -
Subsequent inferences (even with another input):
reasoning_content: [empty string] content: None
Is there an additional step I might be missing to make multiple inferences work properly?
Or is this behavior expected for now while it’s still being developed?
Thanks again for your work — happy to help test or debug further if needed!
I was testing the modifications using
vllm serveand had a quick question.
This should now be addressed. There was a bad fix somewhere then
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
We need to use the tokenizer for the reasoning parser. But if u like i can refactor out the tokenizer change first then rebase this change on top of that PR
seems like the failure on entrypoint is not related 😿