vllm
vllm copied to clipboard
[Feature][v1]: Add metrics support
🚀 The feature, motivation and pitch
We should also be feature parity on metrics with most of available stats if possible. On a high level:
- [P0] Support system and requests stats logging
- [P0] Support metric export to prometheus.
- [P1] Support or deprecate all metrics from V0
- [P1] Allow users to define their own prometheus client and other arbitrary loggers.
- [P2] Make it work with tracing too (there's some request level stats that tracing needs, like queue time, ttft). These request level metric should be possible to be surfaced in v1 too.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Opening the issue to track and collab - in case someone else is already looking into this.
Prototype in https://github.com/vllm-project/vllm/pull/10651
I thought it was about time to update on the latest status of this and note some TODOs.
Status
The v1 engine frontend API server now has a Prometheus-compatible `/metrics' endpoint.
The following PRs should merge soon:
- #12579
- #12592
- #12644
Which will mean we support the following metrics:
vllm:num_requests_running(Gauge)vllm:num_requests_waiting(Gauge)vllm:gpu_cache_usage_perc(Gauge)vllm:prompt_tokens_total(Counter)vllm:generation_tokens_total(Counter)vllm:request_success_total(Counter)vllm:request_prompt_tokens(Histogram)vllm:request_generation_tokens(Histogram)vllm:time_to_first_token_seconds(Histogram)vllm:time_per_output_token_seconds(Histogram)vllm:e2e_request_latency_seconds(Histogram)vllm:request_queue_time_seconds(Histogram)vllm:request_inference_time_seconds(Histogram)vllm:request_prefill_time_seconds(Histogram)vllm:request_decode_time_seconds(Histogram)
Also, note that - vllm:gpu_prefix_cache_queries and vllm:gpu_prefix_cache_hits (Counters) replaces vllm:gpu_prefix_cache_hit_rate (Gauge).
These are most of the metrics used by the example Grafana dashboard, with the exception of:
vllm:num_requests_swapped(Gauge)vllm:cpu_cache_usage_perc(Gauge)vllm:request_max_num_generation_tokens(Histogram)
Additionally, these are other metrics supported by v0, but not yet by v1:
vllm:num_preemptions_total(Counter)vllm:cache_config_info(Gauge)vllm:lora_requests_info(Gauge)vllm:cpu_prefix_cache_hit_rate(Gauge)vllm:tokens_total(Counter)vllm:iteration_tokens_total(Histogram)vllm:time_in_queue_requests(Historgram)vllm:model_forward_time_milliseconds(Histogramvllm:model_execute_time_milliseconds(Histogram)vllm:request_params_n(Histogram)vllm:request_params_max_tokens(Histogram)vllm:spec_decode_draft_acceptance_rate(Gauge)vllm:spec_decode_efficiency(Gauge)vllm:spec_decode_num_accepted_tokens_total(Counter)vllm:spec_decode_num_draft_tokens_total(Counter)vllm:spec_decode_num_emitted_tokens_total(Counter)
Next Steps
- [x] Merge #12579
- [x] Merge #12592
- [x] Merge #12644
- [x] Make sure
--disable-log-statsis disabling everything it can - [x] Go through the remaining v0 metrics (see design doc)
- [x] Merge #13288
- [x] Merge #13295
- [x] Merge #13299
- [x] Merge #13504
- [x] Update design doc with preemption diagrams from #13169
- [x] Merge #13169
- [x] Update Grafana dashboard to work with v1
As a bit of a status update, here's how the example Grafana dashboard currently looks with a serving benchmark run like this:
$ python3 ./benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-8B-Instruct --tokenizer meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 3.0 --num-prompts 200
...
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 71.53
Total input tokens: 42659
Total generated tokens: 43516
Request throughput (req/s): 2.80
Output token throughput (tok/s): 608.37
Total Token throughput (tok/s): 1204.76
---------------Time to First Token----------------
Mean TTFT (ms): 24.47
Median TTFT (ms): 24.67
P99 TTFT (ms): 31.27
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 13.16
Median TPOT (ms): 13.20
P99 TPOT (ms): 13.64
---------------Inter-token Latency----------------
Mean ITL (ms): 13.14
Median ITL (ms): 13.16
P99 ITL (ms): 14.78
==================================================
What's nice about the above is that even though V1 does not have vllm:num_requests_swapped and vllm:cpu_cache_usage_perc (because V1 doesn't have swap-to-CPU preemption mode), it doesn't impact the user experience of the dashboard - i.e. they just don't show up in the Scheduler State and Cache Utilization panels 👍
Here's the latest on what's in V0 versus V1:
| In Both | In V0 Only | In V1 Only |
|---|---|---|
| vllm:cache_config_info | vllm:cpu_cache_usage_perc #14136 | vllm:gpu_prefix_cache_hits #12592 |
| vllm:e2e_request_latency_seconds | vllm:cpu_prefix_cache_hit_rate #14136 | vllm:gpu_prefix_cache_queries #12592 |
| vllm:generation_tokens_total | vllm:gpu_prefix_cache_hit_rate #14136 | |
| vllm:gpu_cache_usage_perc | vllm:model_execute_time_milliseconds #14135 | |
| vllm:iteration_tokens_total | vllm:model_forward_time_milliseconds #14135 | |
| vllm:lora_requests_info | vllm:num_requests_swapped #14136 | |
| vllm:num_preemptions_total | vllm:request_max_num_generation_tokens #14055 | |
| vllm:num_requests_running | vllm:request_params_max_tokens #14055 | |
| vllm:num_requests_waiting | vllm:request_params_n #14055 | |
| vllm:prompt_tokens_total | vllm:spec_decode_draft_acceptance_rate | |
| vllm:request_decode_time_seconds | vllm:spec_decode_efficiency | |
| vllm:request_generation_tokens | vllm:spec_decode_num_accepted_tokens_total | |
| vllm:request_inference_time_seconds | vllm:spec_decode_num_draft_tokens_total | |
| vllm:request_prefill_time_seconds | vllm:spec_decode_num_emitted_tokens_total | |
| vllm:request_prompt_tokens | vllm:time_in_queue_requests #14135 | |
| vllm:request_queue_time_seconds | ~~vllm:tokens_total #14134~~ | |
| vllm:request_success_total | ||
| vllm:time_per_output_token_seconds | ||
| vllm:time_to_first_token_seconds |
Next Steps
- [x] Merge #12745
- [x] Merge #14055
- [x] Merge #14134
- [x] Merge #14135
- [x] Merge #14136
- [x] Merge #14220
- [x] Add spec decoding metrics (where relevant) to #12193
- [ ] Make
arrival_time(and all intervals calculated relative to it) use monotonic time to avoid being affected by system clock changes - a potential issue is thatarrival_timeis part of the public library API - [ ] Review v1 for unused code - e.g. RequestStats in the engine core, arrival_time in EngineCoreRequest, and vllm.v1.stats.common
- [ ] Add Grafana dashboard notes to design doc (e.g. screenshots, V1 compatibility)
- [ ] Update design doc with discussion of
RequestMetricsAPI - [ ] Document parallel sampling implications on other metrics
- [ ] Wrap up LoRA discussion in #13303 and wrt #6275
- [ ] Document the DP design decision to add the
enginelabel in https://github.com/vllm-project/vllm/pull/13923#issuecomment-2714438609 - [ ] Review the metrics proposed in #12726 for v1 inclusion
- [ ] Benchmark metrics collection overhead, and look for improvements
- [ ] Review metrics naming - e.g. use of colon, including units in name, use of '_total'
- [ ] Look into improving Histogram buckets - e.g. in some cases we probably don't have low enough valued buckets to cover the common range
Hi, just wanna check in to see if we have plan to support per request level stats logging? For example: {request_1: {ttit: 10}, {e2e_latency: 200}}.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!