vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature][v1]: Add metrics support

Open rickyyx opened this issue 1 year ago • 7 comments

🚀 The feature, motivation and pitch

We should also be feature parity on metrics with most of available stats if possible. On a high level:

  1. [P0] Support system and requests stats logging
  2. [P0] Support metric export to prometheus.
  3. [P1] Support or deprecate all metrics from V0
  4. [P1] Allow users to define their own prometheus client and other arbitrary loggers.
  5. [P2] Make it work with tracing too (there's some request level stats that tracing needs, like queue time, ttft). These request level metric should be possible to be surfaced in v1 too.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

rickyyx avatar Nov 22 '24 19:11 rickyyx

Opening the issue to track and collab - in case someone else is already looking into this.

rickyyx avatar Nov 22 '24 19:11 rickyyx

Prototype in https://github.com/vllm-project/vllm/pull/10651

rickyyx avatar Nov 26 '24 02:11 rickyyx

I thought it was about time to update on the latest status of this and note some TODOs.

Status

The v1 engine frontend API server now has a Prometheus-compatible `/metrics' endpoint.

The following PRs should merge soon:

  • #12579
  • #12592
  • #12644

Which will mean we support the following metrics:

  • vllm:num_requests_running (Gauge)
  • vllm:num_requests_waiting (Gauge)
  • vllm:gpu_cache_usage_perc (Gauge)
  • vllm:prompt_tokens_total (Counter)
  • vllm:generation_tokens_total (Counter)
  • vllm:request_success_total (Counter)
  • vllm:request_prompt_tokens (Histogram)
  • vllm:request_generation_tokens (Histogram)
  • vllm:time_to_first_token_seconds (Histogram)
  • vllm:time_per_output_token_seconds (Histogram)
  • vllm:e2e_request_latency_seconds (Histogram)
  • vllm:request_queue_time_seconds (Histogram)
  • vllm:request_inference_time_seconds (Histogram)
  • vllm:request_prefill_time_seconds (Histogram)
  • vllm:request_decode_time_seconds (Histogram)

Also, note that - vllm:gpu_prefix_cache_queries and vllm:gpu_prefix_cache_hits (Counters) replaces vllm:gpu_prefix_cache_hit_rate (Gauge).

These are most of the metrics used by the example Grafana dashboard, with the exception of:

  • vllm:num_requests_swapped (Gauge)
  • vllm:cpu_cache_usage_perc (Gauge)
  • vllm:request_max_num_generation_tokens (Histogram)

Additionally, these are other metrics supported by v0, but not yet by v1:

  • vllm:num_preemptions_total (Counter)
  • vllm:cache_config_info (Gauge)
  • vllm:lora_requests_info (Gauge)
  • vllm:cpu_prefix_cache_hit_rate (Gauge)
  • vllm:tokens_total (Counter)
  • vllm:iteration_tokens_total (Histogram)
  • vllm:time_in_queue_requests (Historgram)
  • vllm:model_forward_time_milliseconds (Histogram
  • vllm:model_execute_time_milliseconds (Histogram)
  • vllm:request_params_n (Histogram)
  • vllm:request_params_max_tokens (Histogram)
  • vllm:spec_decode_draft_acceptance_rate (Gauge)
  • vllm:spec_decode_efficiency (Gauge)
  • vllm:spec_decode_num_accepted_tokens_total (Counter)
  • vllm:spec_decode_num_draft_tokens_total (Counter)
  • vllm:spec_decode_num_emitted_tokens_total (Counter)

Next Steps

  • [x] Merge #12579
  • [x] Merge #12592
  • [x] Merge #12644
  • [x] Make sure --disable-log-stats is disabling everything it can
  • [x] Go through the remaining v0 metrics (see design doc)
  • [x] Merge #13288
  • [x] Merge #13295
  • [x] Merge #13299
  • [x] Merge #13504
  • [x] Update design doc with preemption diagrams from #13169
  • [x] Merge #13169
  • [x] Update Grafana dashboard to work with v1

markmc avatar Feb 04 '25 17:02 markmc

As a bit of a status update, here's how the example Grafana dashboard currently looks with a serving benchmark run like this:

$ python3 ./benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-8B-Instruct --tokenizer meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 3.0 --num-prompts 200
...
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  71.53     
Total input tokens:                      42659     
Total generated tokens:                  43516     
Request throughput (req/s):              2.80      
Output token throughput (tok/s):         608.37    
Total Token throughput (tok/s):          1204.76   
---------------Time to First Token----------------
Mean TTFT (ms):                          24.47     
Median TTFT (ms):                        24.67     
P99 TTFT (ms):                           31.27     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.16     
Median TPOT (ms):                        13.20     
P99 TPOT (ms):                           13.64     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.14     
Median ITL (ms):                         13.16     
P99 ITL (ms):                            14.78     
==================================================

Image

Image

markmc avatar Feb 27 '25 13:02 markmc

What's nice about the above is that even though V1 does not have vllm:num_requests_swapped and vllm:cpu_cache_usage_perc (because V1 doesn't have swap-to-CPU preemption mode), it doesn't impact the user experience of the dashboard - i.e. they just don't show up in the Scheduler State and Cache Utilization panels 👍

markmc avatar Feb 27 '25 14:02 markmc

Here's the latest on what's in V0 versus V1:

In Both In V0 Only In V1 Only
vllm:cache_config_info vllm:cpu_cache_usage_perc #14136 vllm:gpu_prefix_cache_hits #12592
vllm:e2e_request_latency_seconds vllm:cpu_prefix_cache_hit_rate #14136 vllm:gpu_prefix_cache_queries #12592
vllm:generation_tokens_total vllm:gpu_prefix_cache_hit_rate #14136
vllm:gpu_cache_usage_perc vllm:model_execute_time_milliseconds #14135
vllm:iteration_tokens_total vllm:model_forward_time_milliseconds #14135
vllm:lora_requests_info vllm:num_requests_swapped #14136
vllm:num_preemptions_total vllm:request_max_num_generation_tokens #14055
vllm:num_requests_running vllm:request_params_max_tokens #14055
vllm:num_requests_waiting vllm:request_params_n #14055
vllm:prompt_tokens_total vllm:spec_decode_draft_acceptance_rate
vllm:request_decode_time_seconds vllm:spec_decode_efficiency
vllm:request_generation_tokens vllm:spec_decode_num_accepted_tokens_total
vllm:request_inference_time_seconds vllm:spec_decode_num_draft_tokens_total
vllm:request_prefill_time_seconds vllm:spec_decode_num_emitted_tokens_total
vllm:request_prompt_tokens vllm:time_in_queue_requests #14135
vllm:request_queue_time_seconds ~~vllm:tokens_total #14134~~
vllm:request_success_total
vllm:time_per_output_token_seconds
vllm:time_to_first_token_seconds

Next Steps

  • [x] Merge #12745
  • [x] Merge #14055
  • [x] Merge #14134
  • [x] Merge #14135
  • [x] Merge #14136
  • [x] Merge #14220
  • [x] Add spec decoding metrics (where relevant) to #12193
  • [ ] Make arrival_time (and all intervals calculated relative to it) use monotonic time to avoid being affected by system clock changes - a potential issue is that arrival_time is part of the public library API
  • [ ] Review v1 for unused code - e.g. RequestStats in the engine core, arrival_time in EngineCoreRequest, and vllm.v1.stats.common
  • [ ] Add Grafana dashboard notes to design doc (e.g. screenshots, V1 compatibility)
  • [ ] Update design doc with discussion of RequestMetrics API
  • [ ] Document parallel sampling implications on other metrics
  • [ ] Wrap up LoRA discussion in #13303 and wrt #6275
  • [ ] Document the DP design decision to add the engine label in https://github.com/vllm-project/vllm/pull/13923#issuecomment-2714438609
  • [ ] Review the metrics proposed in #12726 for v1 inclusion
  • [ ] Benchmark metrics collection overhead, and look for improvements
  • [ ] Review metrics naming - e.g. use of colon, including units in name, use of '_total'
  • [ ] Look into improving Histogram buckets - e.g. in some cases we probably don't have low enough valued buckets to cover the common range

markmc avatar Mar 03 '25 13:03 markmc

Hi, just wanna check in to see if we have plan to support per request level stats logging? For example: {request_1: {ttit: 10}, {e2e_latency: 200}}.

liuzijing2014 avatar Mar 06 '25 19:03 liuzijing2014

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Jun 10 '25 02:06 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Jul 11 '25 02:07 github-actions[bot]