vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[V1][Metrics] Handle preemptions

Open markmc opened this issue 9 months ago • 2 comments

Part of #10582

Add a core engine PREEMPTED event.

Add the num_preemptions_total counter from v0.

Also, make preemptions reset the scheduled and first token timestamps resulting in:

  << queued timestamp >>
    [ queue interval ]
      |
      |	(possible preemptions)
      | << scheduled timestamp >>
      | << preempted timestamp >>
      | << scheduled timestamp >>
      | << new token timestamp (FIRST) >>
      | << preempted timestamp >>
      v
  << scheduled timestamp >>
    [ prefill interval ]
  << new token timestamp (FIRST) >>
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to most recent first token time)
    [ inference interval (relative to most recent scheduled time)
  << new token timestamp (FINISHED) >>

markmc avatar Feb 12 '25 17:02 markmc

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Feb 12 '25 17:02 github-actions[bot]

pre-commit failure is a yapf failure that doesn't happen for me locally:

  File "/home/runner/.cache/pre-commit/repom9gt4aao/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pygram.py", line 39, in <module>
    pattern_grammar = driver.load_grammar(_PATTERN_GRAMMAR_FILE)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/.cache/pre-commit/repom9gt4aao/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pgen2/driver.py", line 248, in load_grammar
    g.load(gp)
  File "/home/runner/.cache/pre-commit/repom9gt4aao/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pgen2/grammar.py", line 128, in load
    d = pickle.load(f)
        ^^^^^^^^^^^^^^
EOFError: Ran out of input

markmc avatar Feb 12 '25 17:02 markmc

So I think that it should look like this:

<< queued timestamp >>  # unique per request and frozen after being set
    [ queue interval ]
<< scheduled timestamp >> # unique per request and frozen after being set
    [ prefill interval ]
<< new token timestamp (FIRST) >> # unique per request and frozen after being set
    [ inter-token interval ]
<< new token timestamp >>
<< preempted timestamp >>
    | request is in the preempted queue
    | request is re-scheduled
    | recompute up to the current token
    [ inter-token interval ]
<< new token timestamp >>
    [ inter-token interval ]
<< new token timestamp >>
<< preempted timestamp >>
    | request is in the preempted queue
    | request is re-scheduled
    | recompute up to the current token
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to scheduled) ] # all the time spent in the preempted + recompute state is allocated here
    [ inference interval (relative to scheduled) ]
  << new token timestamp (FINISHED) >>

robertgshaw2-redhat avatar Feb 20 '25 23:02 robertgshaw2-redhat

One other case we should make sure we cover is --- what happens if the request is preempted during prefill?

E.g. if we have a prompt of length 100k and we get preempted half way through. In this case, the preemption/recomputation time should be allocated to the prefill phase and the TTFT phase

Im not sure if this is possible to happen or whether there is some invariant that we always have enough KV cache for the prompt to be processed. Worth asking Woosuk or Cody

robertgshaw2-redhat avatar Feb 20 '25 23:02 robertgshaw2-redhat

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Feb 25 '25 08:02 mergify[bot]

I drafted some thoughts, and found it super difficult to clarify this in words or ASCII art so let's try this ... does this match your thinking?

vLLM Interval Metrics - Frame 1 vLLM Interval Metrics - Frame 2 vLLM Interval Metrics - Frame 3 (1)

markmc avatar Feb 25 '25 17:02 markmc

In the "preempted prefill" case, I had imagined the queued interval to be up until the final SCHEDULED event ... nothing useful happened with the request, its waiting to be prioritized for resources? Not a big deal, I guess - most important that TTFT ~= queued_interval + prefill_interval

If we're aligned on the diagrams above, I think the code change is simply to not reset scheduled_ts or first_token_ts once they've been set?

markmc avatar Feb 25 '25 17:02 markmc

I am in alignment with the charts above! Thanks for drawing it out!

robertgshaw2-redhat avatar Feb 26 '25 13:02 robertgshaw2-redhat