vllm
vllm copied to clipboard
Fail request if FSM fails to advance
Fix streaming requests hanging when structured output FSM fails to advance
Problem
When using structured outputs with the xgrammar backend, streaming requests would hang indefinitely if the FSM (Finite State Machine) failed to advance. This occurred when accept_tokens() returned False in the xgrammar backend, logging an error but not properly terminating the request.
Diagnosis
The issue was in the scheduler's update_from_output() method. When processing new tokens for structured output requests, the code called accept_tokens() but ignored its return value:
request.structured_output_request.grammar.accept_tokens(req_id, new_token_ids)
When the xgrammar FSM encountered an invalid token sequence, it would:
- Log an error:
"Failed to advance FSM for request %s for tokens %s. Please file an issue." - Return
Falsefromaccept_tokens() - Leave the FSM in an invalid state
Since the scheduler didn't check the return value, it continued processing as if nothing was wrong, causing the streaming response to hang indefinitely without sending a completion signal.
Solution
The fix checks the return value of accept_tokens() and properly terminates the request when it returns False:
if not request.structured_output_request.grammar.accept_tokens(req_id, new_token_ids):
# Grammar FSM failed to advance - mark request as finished with error
logger.error(
"Structured output FSM failed to advance for request %s. "
"Terminating request.", req_id)
request.status = RequestStatus.FINISHED_ABORTED
stopped = True
self._free_request(request)
This ensures that:
- The request is marked as
FINISHED_ABORTED - Resources are properly freed
- The streaming response terminates with finish_reason: "abort"
- Clients receive a proper completion signal instead of hanging
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
Could we add a test to this PR?
@cadedaniel for visibility https://vllm-dev.slack.com/archives/C07QQ8DAXMK/p1748388146836209
This happens in the case where "after a few hundred thousand requests are sent to the same instance".
For tests I think we might be able to reproduce something when we send the same requests repeatedly? might need some fine tune for this regression test.
I think one could mock the output of the model to be an invalid token wrt the grammar.
I didnt understand why would accept_tokens() not accept a token as per the grammar when the grammar itself decides the mask and makes sure the transition is valid. My understanding was that the valid token sampled with bitmask is guaranteed to be pass the accept_token() w/o error.
PR related to the root cause of the problem that was occurring here: #19565
@russellb I think we should still have this orthogonal to #19565. If the FSM fails to advance, we should gracefully fail this request, wdyt?
@russellb I think we should still have this orthogonal to #19565. If the FSM fails to advance, we should gracefully fail this request, wdyt?
I agree that this change is still an improvement
Friendly bump on this. I am also seeing same log error. pointing to line 160.
Friendly bump on this. I am also seeing same log error. pointing to line 160.
what version? We fixed the underlying issue a while ago that led to this proposed change. If you see the FSM fail to advice, it indicates a problem we need to fix.
@russellb This is the error that I'm getting. This is for sure on v0.10.0 for sure, need to check on the latest release. Inference is happening with Instructor client and structured responses. Model is llama 3.1 8b instruct
ERROR 08-16 21:59:58 [backend_xgrammar.py:160] Failed to advance FSM for request chatcmpl-42f20635-3867-4126-889f-f8075ce1155b for tokens 0. Please file an issue.
@russellb This is the error that I'm getting. This is for sure on v0.10.0 for sure, need to check on the latest release. Inference is happening with Instructor client and structured responses. Model is llama 3.1 8b instruct
ERROR 08-16 21:59:58 [backend_xgrammar.py:160] Failed to advance FSM for request chatcmpl-42f20635-3867-4126-889f-f8075ce1155b for tokens 0. Please file an issue.
If you can come up with an easy way to reproduce it, please file an issue so we can look into it. Thanks!
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @atbe.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Friendly reminding about this PR. I think it would be extremely helpful at least to fail gracefully ;)