vllm
vllm copied to clipboard
[V1][Metrics] Handle preemptions
Part of #10582
Add a core engine PREEMPTED event.
Add the num_preemptions_total counter from v0.
Also, make preemptions reset the scheduled and first token timestamps resulting in:
<< queued timestamp >>
[ queue interval ]
|
| (possible preemptions)
| << scheduled timestamp >>
| << preempted timestamp >>
| << scheduled timestamp >>
| << new token timestamp (FIRST) >>
| << preempted timestamp >>
v
<< scheduled timestamp >>
[ prefill interval ]
<< new token timestamp (FIRST) >>
[ inter-token interval ]
<< new token timestamp >>
[ decode interval (relative to most recent first token time)
[ inference interval (relative to most recent scheduled time)
<< new token timestamp (FINISHED) >>
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
pre-commit failure is a yapf failure that doesn't happen for me locally:
File "/home/runner/.cache/pre-commit/repom9gt4aao/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pygram.py", line 39, in <module>
pattern_grammar = driver.load_grammar(_PATTERN_GRAMMAR_FILE)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/.cache/pre-commit/repom9gt4aao/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pgen2/driver.py", line 248, in load_grammar
g.load(gp)
File "/home/runner/.cache/pre-commit/repom9gt4aao/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pgen2/grammar.py", line 128, in load
d = pickle.load(f)
^^^^^^^^^^^^^^
EOFError: Ran out of input
So I think that it should look like this:
<< queued timestamp >> # unique per request and frozen after being set
[ queue interval ]
<< scheduled timestamp >> # unique per request and frozen after being set
[ prefill interval ]
<< new token timestamp (FIRST) >> # unique per request and frozen after being set
[ inter-token interval ]
<< new token timestamp >>
<< preempted timestamp >>
| request is in the preempted queue
| request is re-scheduled
| recompute up to the current token
[ inter-token interval ]
<< new token timestamp >>
[ inter-token interval ]
<< new token timestamp >>
<< preempted timestamp >>
| request is in the preempted queue
| request is re-scheduled
| recompute up to the current token
[ inter-token interval ]
<< new token timestamp >>
[ decode interval (relative to scheduled) ] # all the time spent in the preempted + recompute state is allocated here
[ inference interval (relative to scheduled) ]
<< new token timestamp (FINISHED) >>
One other case we should make sure we cover is --- what happens if the request is preempted during prefill?
E.g. if we have a prompt of length 100k and we get preempted half way through. In this case, the preemption/recomputation time should be allocated to the prefill phase and the TTFT phase
Im not sure if this is possible to happen or whether there is some invariant that we always have enough KV cache for the prompt to be processed. Worth asking Woosuk or Cody
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @markmc.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
I drafted some thoughts, and found it super difficult to clarify this in words or ASCII art so let's try this ... does this match your thinking?
In the "preempted prefill" case, I had imagined the queued interval to be up until the final SCHEDULED event ... nothing useful happened with the request, its waiting to be prioritized for resources? Not a big deal, I guess - most important that TTFT ~= queued_interval + prefill_interval
If we're aligned on the diagrams above, I think the code change is simply to not reset scheduled_ts or first_token_ts once they've been set?
I am in alignment with the charts above! Thanks for drawing it out!