vllm
vllm copied to clipboard
[v1][Metrics] Add design doc
Related to #10582. Some notes I had taken on v0 metrics implementation, along with v1 design details.
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can do one of these:
- Add
readylabel to the PR - Enable auto-merge.
🚀
FYI, this use of colons in direct metric names is contrary to https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels
I see that most follow the desideratum of ending with the units, but time_in_queue_requests does not.
While backward compatibility is good, maybe there should also be a plan to shift to more compliant metric names?
When the metrics were originally added the contributor wasn't aware of the naming convention. When this was noticed it was decided that we would leave them as is so that we wouldn't break any dashboards that people had set up.
I agree that this could be a good opportunity to switch to compliant names. I think using the correct convention would inspire confidence in users who are already familliar with tools like Prometheus.
When the metrics were originally added the contributor wasn't aware of the naming convention. When this was noticed it was decided that we would leave them as is so that we wouldn't break any dashboards that people had set up.
I agree that this could be a good opportunity to switch to compliant names. I think using the correct convention would inspire confidence in users who are already familliar with tools like Prometheus.
I added a section on this, thanks :+1:
This is a really valuable document, thank you for putting the time and effort into creating it!
Thank you!
A couple of whitespace nits.
My comments about cross-referencing still stand. I'd do it myself, but I don't have GitHub permission to.
Really appreciate the help, thanks. I'd happily pull in commits from a branch in your repo.
You can cherrypick this commit https://github.com/hmellor/vllm/commit/1876f9cc5123f74239779d38005dc80f0c7552c3
It seems like there might be an actual issue with the docs build