vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[v1][Metrics] Add design doc

Open markmc opened this issue 10 months ago • 8 comments

Related to #10582. Some notes I had taken on v0 metrics implementation, along with v1 design details.

markmc avatar Feb 04 '25 18:02 markmc

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

github-actions[bot] avatar Feb 04 '25 18:02 github-actions[bot]

FYI, this use of colons in direct metric names is contrary to https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels

I see that most follow the desideratum of ending with the units, but time_in_queue_requests does not.

While backward compatibility is good, maybe there should also be a plan to shift to more compliant metric names?

MikeSpreitzer avatar Feb 10 '25 15:02 MikeSpreitzer

When the metrics were originally added the contributor wasn't aware of the naming convention. When this was noticed it was decided that we would leave them as is so that we wouldn't break any dashboards that people had set up.

I agree that this could be a good opportunity to switch to compliant names. I think using the correct convention would inspire confidence in users who are already familliar with tools like Prometheus.

hmellor avatar Feb 10 '25 16:02 hmellor

When the metrics were originally added the contributor wasn't aware of the naming convention. When this was noticed it was decided that we would leave them as is so that we wouldn't break any dashboards that people had set up.

I agree that this could be a good opportunity to switch to compliant names. I think using the correct convention would inspire confidence in users who are already familliar with tools like Prometheus.

I added a section on this, thanks :+1:

markmc avatar Feb 13 '25 15:02 markmc

This is a really valuable document, thank you for putting the time and effort into creating it!

Thank you!

A couple of whitespace nits.

My comments about cross-referencing still stand. I'd do it myself, but I don't have GitHub permission to.

Really appreciate the help, thanks. I'd happily pull in commits from a branch in your repo.

markmc avatar Feb 19 '25 17:02 markmc

You can cherrypick this commit https://github.com/hmellor/vllm/commit/1876f9cc5123f74239779d38005dc80f0c7552c3

hmellor avatar Feb 19 '25 19:02 hmellor

You can cherrypick this commit hmellor@1876f9c

Thanks @hmellor appreciate the help!

TIL MyST!

markmc avatar Feb 24 '25 11:02 markmc

It seems like there might be an actual issue with the docs build

robertgshaw2-redhat avatar Mar 03 '25 17:03 robertgshaw2-redhat