vllm
vllm copied to clipboard
Rs/feature/metrics
Summary
This PR does three things:
A) Addresses open feature request (#1870) by refactoring and extending initial implementation of metrics (#1890) to:
- Handle more complex metrics such as request level latencies (as requested)
- Build interfaces to make it easier to add new metrics in the future
B) Creates an end-to-end example for how to monitoring vLLM with Prometheus and Grafana
C) Updates the existing metric implementations to follow Prometheus best practices, namely:
- Metric Naming (Prom Docs) ->
vllm:num_requests_runningshould bevllm_num_requests_running_total - Base Units (Prom Docs) -> times should be s rather than ms
- Calculating averages as
GaugesratherCounters+ PromQLrate(Prom Docs) ->vllm:avg_generation_throughput_toks_per_secshould be aCountercalledvllm_generation_tokens_totaland use PromQLrate(vllm_generation_tokens_total[5s])to calc tokens / second during dashboarding.
A) Implementation
Created / updated the following classes:
-
SequenceGroup: addedlast_token_timevariable andget_last_latency/get_e2e_latency) methods, which enables us to capture the request-level latencies if logging is enabled. -
LLMEngine: added aPrometheusLoggerand logic to createStats, making a cleaner interface between theLLMEngineand logging-related functionality.- At the the last step of
_process_model_outputs, we callLLMEngine._get_statsto generateStatsthat are passed to thePrometheusLogger.log.
- At the the last step of
-
PrometheusLogger: holds a list ofPrometheusMetricsand passes theStatgenerated by theLLMEngineto each.PrometheusMetric: holds a metric (aioprometheuscollectorCounter,Gauge,Histogram) and a function to extract the appropriate data fromStats
Within this framework, created a registry of PrometheusMetrics:
- Re-implemented the existing metrics (modulo updating to conform to Prometheus best practices)
- Implemented new request-level latency timing metrics (TTFT, Inter-Token, and E2E Latency)
Currently Supported Include:
counter_prompt_tokens--> used with rate() to calculate prompt token throughputcounter_generation_tokens--> used with rate() to calculate generation token throughputgauge_scheduler_runninggauge_scheduler_swappedgauge_scheduler_waitinggauge_gpu_cache_usagegauge_cpu_cache_usagehistogram_time_to_first_token--> exposes counters needed to calculate avg ttft, P50, P90, P95, P99histogram_inter_token_latency--> exposes counters needed to calculate avg itl, P50, P90, P95, P99histogram_e2e_request_latency--> exposes counters needed to calculate e2e request latency, P50, P90, P95, P99
See the Example for a dashboard that shows how these exposed metrics should be monitored
B) Example
See examples/production_monitoring for an end-to-end example. I included a Grafana dashboard configuration which shows how these metrics should be monitored.
C) Best Practices
I recognize these changes have breaking impacts on the metrics exposed to users.
Key changes include:
- Updated the names of EACH metric: (e.g.
vllm:num_requests_swapped-->vllm_requests_stopped_total) - Updated average token throughput gauges (
vllm:avg_prompt_throughput_toks_per_s/vllm:avg_generation_throughput_toks_per_s) to be total tokens processed counters (vllm_prompt_tokens_total/vllm_generation_tokens_total)- On the dashboard, we can calculate average tokens/sec with
rate(vllm_prompt_tokens_total[30s])
- On the dashboard, we can calculate average tokens/sec with
My sense is that this is a very new feature, so Im not sure how much user impact there is. However, I think the changes I am suggesting are justified. I am happy to revert these if requested.
Overhead
I used the benchmarking scripts to test performance with and without the logger on an L4 GPU. There is very minor latency.
-
benchmark_serving.pyClient:python3 benchmark_serving.py --backend vllm --tokenizer mistralai/Mistral-7B-v0.1 --dataset /home/robertgshaw/vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 1.0 --num-prompts 200 -
Launch with System Logging:
python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-requests
Total time: 265.46 s
Throughput: 0.75 requests/s
Average latency: 18.12 s
Average latency per token: 0.04 s
Average latency per output token: 0.08 s
- Launch with System Logging Off:
python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-stats --disable-log-requests
Total time: 265.26 s
Throughput: 0.75 requests/s
Average latency: 18.06 s
Average latency per token: 0.04 s
Average latency per output token: 0.08 s
Next Steps
Next steps to finalize the PR are:
- Running benchmarks on a more powerful system than an L4
Questions
Are there any other things I need to do?
@simon-mo this addresses #1870. Any feedback would be appreciated.
Additionally, this is my first time contributing to vLLM. I tried to get up to speed on the whole repo/submission process/QA process, but let me know if I am missing anything. Thanks.
Hi @rib-2, thank you for your contribution. This PR is definitely in the right direction, few things to start:
- We are recommending the OpenAI compatible server as the only server, the internal api_server is only to be used for demo and benchmark purpose. Can you limit the change to OpenAI compatible server?
- For the change in abstracting metrics capturing, can you make sure to keep the existing logging message intact? It is heavily used as one of the main diagnostic information.
- Regarding metrics definition, I find the
fninCounter,Gauge, andHistogramhard to interpret from a code readability and maintainability point of view. Is there a simpler to way to achieve something like this?
@simon-mo Sounds good. Thanks for the feedback.
- Will remove from
api_server. I just added that in so I could test the overhead with the existing benchmarking scripts - I will try to refactor the
fnpiece. I originally had each metric as a new class that inherits fromCounterMetric(etc) and it got a bit verbose, but I will think through it again.
For the existing logging message --> are you referring to the logger to the command line?
For the existing logging message --> are you referring to the logger to the command line?
Yes!
Update to address @simon-mo 's feedback
-
Removed
/metricsfromapi_server.py -
Added back in local logging. This required a change to
PrometheusMetricto holdTrackedStatswhich stores data over thelocal_logging_interval=5sto compute metrics like average throughput. -
Refactored
PrometheusMetricto avoid needing to pass hard to maintain / confusingfnto convert aStatto the logging format. Instead, thePrometheusMetricnow log an attribute ofStatsdirectly. So we can create aPrometheusMetricby passing theaioprometheuscollector, theattrof theStatsclass to use, and a template for converting themetricto a string for local logging.
CounterMetric(
counter=counter_prompt_tokens,
attr="num_prompt_tokens",
template="Prompt throughput: {:0.2f} tok/sec",
)
Additionally, the metrics.py file was getting a bit confusing to have the class definitions in the same file as the metric definitions, so I separated into metrics.py which implements the classes and metrics_registry.py which is where the actual metrics are defined.
python3 benchmark_serving.py --request-rate 5.0 --dataset ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer mistralai/Mistral-7B-v0.1
on A100:
Performance from benchmark_serving with logging enabled:
Total time: 215.30 s
Throughput: 4.64 requests/s
Average latency: 7.73 s
Average latency per token: 0.02 s
Average latency per output token: 0.04 s
Performance from benchmark_serving with logging disabled:
Total time: 214.20 s
Throughput: 4.67 requests/s
Average latency: 7.27 s
Average latency per token: 0.02 s
Average latency per output token: 0.04 s
@NikolaBorisov thanks for the feedback. Addressed a lot of your concerns in threads.
Additionally addressing your comments:
- I don't think the readme, the docker-compose file, the prometheus.yaml file or the simulate_clients files are appropriate. Most people that will monitor vllm with Prometheus already have Prometheus setup and Grafana. I just don't think they are needed.
I agree setting up Grafana and Prometheus is probably not useful here. However, I think the grafana.json dashboard is really the point (to show an example of how someone should use the metrics being logged to prometheus to compute the percentiles for latency). However, its not really possible to have a self-contained example of using the grafana dashboard without setting up prometheus and grafana itself. As a result, I included it as an example for how someone might interpret the metrics being logged, but obviously this can be removed if others do not think it is valuable.
Also, I put it in the examples directory for this purpose.
- Logging and the metrics should not be coupled. Only handful of metrics should be logged.
I am confused what you mean by this. Could you clarify further?
@simon-mo requested in #1870 that more metrics be added to the server. vLLM currently supports logging of these metrics locally to the stdout and to Prometheus. I thought it made sense to have a single class MetricsLogger which handles that is separated from LLMEngine to handle this.
I agree setting up Grafana and Prometheus is probably not useful here. However, I think the
grafana.jsondashboard is really the point (to show an example of how someone should use the metrics being logged to prometheus to compute the percentiles for latency). However, its not really possible to have a self-contained example of using the grafana dashboard without setting up prometheus and grafana itself. As a result, I included it as an example for how someone might interpret the metrics being logged, but obviously this can be removed if others do not think it is valuable.Also, I put it in the
examplesdirectory for this purpose.
The grafana.json file is good. I think it should stay.
- Logging and the metrics should not be coupled. Only handful of metrics should be logged.
I am confused what you mean by this. Could you clarify further?
The log line printed to the standard output is meant to be human readable. It should only log few important metrics. The Prometheus metrics will be a lot more. Most Prometheus metrics we add should not be printed to the logger.
I agree setting up Grafana and Prometheus is probably not useful here. However, I think the
grafana.jsondashboard is really the point (to show an example of how someone should use the metrics being logged to prometheus to compute the percentiles for latency). However, its not really possible to have a self-contained example of using the grafana dashboard without setting up prometheus and grafana itself. As a result, I included it as an example for how someone might interpret the metrics being logged, but obviously this can be removed if others do not think it is valuable. Also, I put it in theexamplesdirectory for this purpose.The grafana.json file is good. I think it should stay.
Okay great. Can restructure the example to just have the grafana.json if the rest is not valuable.
- Logging and the metrics should not be coupled. Only handful of metrics should be logged.
I am confused what you mean by this. Could you clarify further?
The log line printed to the standard output is meant to be human readable. It should only log few important metrics. The Prometheus metrics will be a lot more. Most Prometheus metrics we add should not be printed to the logger.
Agreed. you can see in the definition of the PrometheusMetric that there is log_local argument which can be passed. If set to False that metric will not be added locally. For instance, the new metrics I added are not logged locally.
I very much recognize that this PR adds more abstraction / code in contrast to the current "direct" implementation of metric logging where everything is handled in the LLMEngine. I did this to try to remove complexity from the LLMEngine (particularly around maintaining the last 5s worth of statistics in attributes) to make it easier to add new metrics (and even new logging backends - such as supporting AWS CloudWatch for SageMaker deployments) in the future.
Please let me know if this is not something that is useful
Just my 2 cents. All of the outside (surface visible) changes are good, however the stats classes add a lot of noise for not much in return. The idea to use a Stats class that gets computed in the engine and is then passed to the logging/metrics code is great. However all the rest of the classes add very little. Also, it's not really easy to modify for particular deployment (I.e the code needs to be patched).
If some of the metrics are more time-consuming (i.e make latency worse), it makes sense to put them behind a flag.
Also you can drop all the helper classes and replace them with a command-line supported log line, so people can choose what to see and what not (with a reasonable default, of course), including how often to log (in a separate flag). For running casually on dev machine once in 10s might be fine, but for production 1s might make a lot more sense.
Thanks @ichernev - I will modify the code as you have described
If we want to add more loggers later, having the abstractions might make sense.
Will remove from api_server. I just added that in so I could test the overhead with the existing benchmarking scripts
In case anyone is still interested in this, I made #2172 a few weeks ago which allows you to use the existing benchmark script with the OpenAI compatible server.
@simon-mo ready for your review
On A100-80GB mistral-7b with metrics:
python3 ../../benchmarks/benchmark_serving.py --model mistralai/Mistral-7B-v0.1 --tokenizer mistralai/Mistral-7B-v0.1 --endpoint /v1/completions --dataset ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 5.
Average latency: 4.63 s
Average latency per token: 0.01 s
Average latency per output token: 0.02 s
On A100-80GB mistral-7b without metrics:
python3 ../../benchmarks/benchmark_serving.py --model mistralai/Mistral-7B-v0.1 --tokenizer mistralai/Mistral-7B-v0.1 --endpoint /v1/completions --dataset ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 5.
Average latency: 4.53 s
Average latency per token: 0.01 s
Average latency per output token: 0.02 s
I'm not a reviewer so feel free to ignore me, but this PR might be a nice opportunity to switch to using the official Prometheus Python Client because it is more actively maintained and has many more stars (3.7k vs aioprometheus's 158)?
@HMellor do you happen to know if that client supports asyncio? I can do a deeper diver if needed
I'm not sure why the decision was made for aioprometheus but I thought it might be asyncio related based on the commentary in the library
@rib-2 @HMellor I asked about the decision of using aioprometheus over the official package in the original PR and here is a link to the response I received https://github.com/vllm-project/vllm/pull/1890#issuecomment-1881491178
In the Prometheus docs it appears that they do support FastAPI in a very similar way to aioprometheus, so it looks like it should just work.
@ronensc I had seen this and was going to reference it but you beat me to it! I interpret the answer as "it appears to work well enough and we don't have strong feelings about switching" so it was left as is.
Personally I'd be in favour of switching to the first-party library if the change is as simple as:
- from aioprometheus import MetricsMiddleware
- from aioprometheus.asgi.starlette import metrics
+ from prometheus_client import make_asgi_app
...
- app.add_middleware(MetricsMiddleware)
- app.add_route("/metrics", metrics)
+ metrics_app = make_asgi_app()
+ app.mount("/metrics", metrics_app)
In the Prometheus docs it appears that they do support FastAPI in a very similar way to
aioprometheus, so it looks like it should just work.@ronensc I had seen this and was going to reference it but you beat me to it! I interpret the answer as "it appears to work well enough and we don't have strong feelings about switching" so it was left as is.
Personally I'd be in favour of switching to the first-party library if the change is as simple as:
- from aioprometheus import MetricsMiddleware - from aioprometheus.asgi.starlette import metrics + from prometheus_client import make_asgi_app ... - app.add_middleware(MetricsMiddleware) - app.add_route("/metrics", metrics) + metrics_app = make_asgi_app() + app.mount("/metrics", metrics_app)
I will try this and do a performance testing to confirm it works okay
In the Prometheus docs it appears that they do support FastAPI in a very similar way to
aioprometheus, so it looks like it should just work.
I read the code and the Prometheus Python Client should also work. We added the metrics to vllm originally, and we just used aioprometheus because we were using it in another code async code-base. Should be ok to switch, but I would do it in another PR.
In the Prometheus docs it appears that they do support FastAPI in a very similar way to
aioprometheus, so it looks like it should just work.I read the code and the Prometheus Python Client should also work. We added the metrics to vllm originally, and we just used aioprometheus because we were using it in another code async code-base. Should be ok to switch, but I would do it in another PR.
I like the idea of doing it in another PR
Only outstanding item I think is the time_per_output_token calculation. I left some thoughts @simon-mo, I believe the way we are doing it now works well and calculates properly. LMK if my comments make sense
- fixed aesthetic changes
- fixed the metric names so that they don't break the existing use (@NikolaBorisov's request)
@simon-mo requesting re-review