vllm [V1] Structured Outputs + Thinking compatibility

This PR brings thinking support to structured outputs in V1. Currently, if you want to use thinking parser in conjunction with structured outputs, you have to use the V0 engine.

This is also compatible with speculative decoding

This PR also refactor the tokenizer onto the structured_output_manager in order to construct the reasoner.

I have also added tests to cover this case.

Tests with the following:

# thinking + structured outputs
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --guided-decoding-backend xgrammar --reasoning-parser deepseek_r1

# thinking + ngram + structured outputs
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --guided-decoding-backend xgrammar --reasoning-parser deepseek_r1 --speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 1}

Closes #14727

Apr 14 '25 07:04 aarnphm

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Apr 14 '25 07:04 github-actions[bot]

cc @gaocegege might be interested

Apr 14 '25 07:04 aarnphm

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 17 '25 16:04 mergify[bot]

Hi @aarnphm, thanks for the PR and all the work on this!

I was testing the modifications using vllm serve and had a quick question.

I ran the server with:

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --enable-reasoning --reasoning-parser deepseek_r1 \
    --tensor-parallel-size 4 \
    --max-model-len 32768

And used the following client code:

class CarType(str, Enum):
    sedan = "sedan"
    suv = "SUV"
    truck = "Truck"
    coupe = "Coupe"

class CarDescription(BaseModel):
    brand: str
    model: str
    car_type: CarType

json_schema = CarDescription.model_json_schema()

prompt = ("Generate a JSON with the brand, model, and car_type of "
          "the most iconic car from the 90's")

completion = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    messages=[{
        "role": "user",
        "content": prompt,
    }],
    extra_body={"guided_json": json_schema},
)

print("reasoning_content:", completion.choices[0].message.reasoning_content)
print("content:", completion.choices[0].message.content)

On the first inference, everything works perfectly — I receive both the reasoning and the generated JSON output.
However, on subsequent inferences (even with different inputs), I notice that reasoning_content is an empty string and content is None.

Here’s what I observed:

First inference:

reasoning_content: Okay, so I need to generate a JSON that includes the brand, model, and car type of the most iconic car from the 90's. Hmm, where do I start? First, I should figure out which car is considered the most iconic from that decade.
... [omitting some text for brevity] ...
Putting it all together, the JSON should include brand: "Nissan", model: "Skyline GT-R", and car_type: "Performance Coupe". That seems to cover the most iconic aspects of the car from the 90s.

content: {
  "brand": "Nissan",
  "model": "Skyline GT-R",
  "car_type": "SUV"
}

Subsequent inferences (even with another input):

reasoning_content: [empty string]
content: None

Is there an additional step I might be missing to make multiple inferences work properly?
Or is this behavior expected for now while it’s still being developed?

Thanks again for your work — happy to help test or debug further if needed!

Apr 25 '25 23:04 matheusalb

I was testing the modifications using vllm serve and had a quick question.

This should now be addressed. There was a bad fix somewhere then

Apr 26 '25 12:04 aarnphm

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 29 '25 02:04 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 29 '25 21:04 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 30 '25 03:04 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 01 '25 15:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 08 '25 05:05 mergify[bot]

We need to use the tokenizer for the reasoning parser. But if u like i can refactor out the tokenizer change first then rebase this change on top of that PR

May 09 '25 16:05 aarnphm

seems like the failure on entrypoint is not related 😿

May 13 '25 13:05 aarnphm

vllm vllm copied to clipboard

[V1] Structured Outputs + Thinking compatibility

vllm
vllm copied to clipboard