vllm Fail request if FSM fails to advance

Fix streaming requests hanging when structured output FSM fails to advance

Problem

When using structured outputs with the xgrammar backend, streaming requests would hang indefinitely if the FSM (Finite State Machine) failed to advance. This occurred when accept_tokens() returned False in the xgrammar backend, logging an error but not properly terminating the request.

Diagnosis

The issue was in the scheduler's update_from_output() method. When processing new tokens for structured output requests, the code called accept_tokens() but ignored its return value:

request.structured_output_request.grammar.accept_tokens(req_id, new_token_ids)

When the xgrammar FSM encountered an invalid token sequence, it would:

Log an error: "Failed to advance FSM for request %s for tokens %s. Please file an issue."
Return False from accept_tokens()
Leave the FSM in an invalid state

Since the scheduler didn't check the return value, it continued processing as if nothing was wrong, causing the streaming response to hang indefinitely without sending a completion signal.

Solution

The fix checks the return value of accept_tokens() and properly terminates the request when it returns False:

if not request.structured_output_request.grammar.accept_tokens(req_id, new_token_ids):
    # Grammar FSM failed to advance - mark request as finished with error
    logger.error(
        "Structured output FSM failed to advance for request %s. "
        "Terminating request.", req_id)
    request.status = RequestStatus.FINISHED_ABORTED
    stopped = True
    self._free_request(request)

This ensures that:

The request is marked as FINISHED_ABORTED
Resources are properly freed
The streaming response terminates with finish_reason: "abort"
Clients receive a proper completion signal instead of hanging

May 27 '25 22:05 atbe

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

May 27 '25 22:05 github-actions[bot]

Could we add a test to this PR?

Jun 03 '25 18:06 cadedaniel

@cadedaniel for visibility https://vllm-dev.slack.com/archives/C07QQ8DAXMK/p1748388146836209

This happens in the case where "after a few hundred thousand requests are sent to the same instance".

For tests I think we might be able to reproduce something when we send the same requests repeatedly? might need some fine tune for this regression test.

Jun 03 '25 18:06 aarnphm

I think one could mock the output of the model to be an invalid token wrt the grammar.

Jun 03 '25 21:06 cadedaniel

I didnt understand why would accept_tokens() not accept a token as per the grammar when the grammar itself decides the mask and makes sure the transition is valid. My understanding was that the valid token sampled with bitmask is guaranteed to be pass the accept_token() w/o error.

Jun 09 '25 19:06 ekagra-ranjan

PR related to the root cause of the problem that was occurring here: #19565

Jun 12 '25 16:06 russellb

@russellb I think we should still have this orthogonal to #19565. If the FSM fails to advance, we should gracefully fail this request, wdyt?

Jun 12 '25 18:06 aarnphm

@russellb I think we should still have this orthogonal to #19565. If the FSM fails to advance, we should gracefully fail this request, wdyt?

I agree that this change is still an improvement

Jun 12 '25 19:06 russellb

Friendly bump on this. I am also seeing same log error. pointing to line 160.

Aug 20 '25 14:08 ma-armenta

Friendly bump on this. I am also seeing same log error. pointing to line 160.

what version? We fixed the underlying issue a while ago that led to this proposed change. If you see the FSM fail to advice, it indicates a problem we need to fix.

Aug 20 '25 14:08 russellb

@russellb This is the error that I'm getting. This is for sure on v0.10.0 for sure, need to check on the latest release. Inference is happening with Instructor client and structured responses. Model is llama 3.1 8b instruct

ERROR 08-16 21:59:58 [backend_xgrammar.py:160] Failed to advance FSM for request chatcmpl-42f20635-3867-4126-889f-f8075ce1155b for tokens 0. Please file an issue.

Aug 20 '25 15:08 ma-armenta

@russellb This is the error that I'm getting. This is for sure on v0.10.0 for sure, need to check on the latest release. Inference is happening with Instructor client and structured responses. Model is llama 3.1 8b instruct

ERROR 08-16 21:59:58 [backend_xgrammar.py:160] Failed to advance FSM for request chatcmpl-42f20635-3867-4126-889f-f8075ce1155b for tokens 0. Please file an issue.

If you can come up with an easy way to reproduce it, please file an issue so we can look into it. Thanks!

Aug 20 '25 15:08 russellb

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @atbe.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sep 08 '25 16:09 mergify[bot]

Friendly reminding about this PR. I think it would be extremely helpful at least to fail gracefully ;)

Nov 03 '25 15:11 FilippoBoni1921

vllm vllm copied to clipboard

Fail request if FSM fails to advance

Fix streaming requests hanging when structured output FSM fails to advance

Problem

Diagnosis

Solution

vllm
vllm copied to clipboard