vllm Rs/feature/metrics

Summary

This PR does three things:

A) Addresses open feature request (#1870) by refactoring and extending initial implementation of metrics (#1890) to:

Handle more complex metrics such as request level latencies (as requested)
Build interfaces to make it easier to add new metrics in the future

B) Creates an end-to-end example for how to monitoring vLLM with Prometheus and Grafana

C) Updates the existing metric implementations to follow Prometheus best practices, namely:

Metric Naming (Prom Docs) -> vllm:num_requests_running should be vllm_num_requests_running_total
Base Units (Prom Docs) -> times should be s rather than ms
Calculating averages as Gauges rather Counters + PromQL rate(Prom Docs) -> vllm:avg_generation_throughput_toks_per_sec should be a Counter called vllm_generation_tokens_total and use PromQL rate(vllm_generation_tokens_total[5s]) to calc tokens / second during dashboarding.

A) Implementation

Created / updated the following classes:

SequenceGroup: added last_token_time variable and get_last_latency/get_e2e_latency) methods, which enables us to capture the request-level latencies if logging is enabled.
LLMEngine: added a PrometheusLogger and logic to create Stats, making a cleaner interface between the LLMEngine and logging-related functionality.
- At the the last step of _process_model_outputs, we call LLMEngine._get_stats to generate Stats that are passed to the PrometheusLogger.log.
PrometheusLogger: holds a list of PrometheusMetrics and passes the Stat generated by the LLMEngine to each.
- PrometheusMetric: holds a metric (aioprometheus collector Counter, Gauge, Histogram) and a function to extract the appropriate data from Stats

Within this framework, created a registry of PrometheusMetrics:

Re-implemented the existing metrics (modulo updating to conform to Prometheus best practices)
Implemented new request-level latency timing metrics (TTFT, Inter-Token, and E2E Latency)

Currently Supported Include:

counter_prompt_tokens --> used with rate() to calculate prompt token throughput
counter_generation_tokens --> used with rate() to calculate generation token throughput
gauge_scheduler_running
gauge_scheduler_swapped
gauge_scheduler_waiting
gauge_gpu_cache_usage
gauge_cpu_cache_usage
histogram_time_to_first_token --> exposes counters needed to calculate avg ttft, P50, P90, P95, P99
histogram_inter_token_latency --> exposes counters needed to calculate avg itl, P50, P90, P95, P99
histogram_e2e_request_latency --> exposes counters needed to calculate e2e request latency, P50, P90, P95, P99

See the Example for a dashboard that shows how these exposed metrics should be monitored

B) Example

See examples/production_monitoring for an end-to-end example. I included a Grafana dashboard configuration which shows how these metrics should be monitored.

C) Best Practices

I recognize these changes have breaking impacts on the metrics exposed to users.

Key changes include:

Updated the names of EACH metric: (e.g. vllm:num_requests_swapped --> vllm_requests_stopped_total)
Updated average token throughput gauges (vllm:avg_prompt_throughput_toks_per_s / vllm:avg_generation_throughput_toks_per_s) to be total tokens processed counters (vllm_prompt_tokens_total / vllm_generation_tokens_total)
- On the dashboard, we can calculate average tokens/sec with rate(vllm_prompt_tokens_total[30s])

My sense is that this is a very new feature, so Im not sure how much user impact there is. However, I think the changes I am suggesting are justified. I am happy to revert these if requested.

Overhead

I used the benchmarking scripts to test performance with and without the logger on an L4 GPU. There is very minor latency.

benchmark_serving.py Client: python3 benchmark_serving.py --backend vllm --tokenizer mistralai/Mistral-7B-v0.1 --dataset /home/robertgshaw/vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 1.0 --num-prompts 200
Launch with System Logging: python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-requests

Total time: 265.46 s
Throughput: 0.75 requests/s
Average latency: 18.12 s
Average latency per token: 0.04 s
Average latency per output token: 0.08 s

Launch with System Logging Off: python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-stats --disable-log-requests

Total time: 265.26 s
Throughput: 0.75 requests/s
Average latency: 18.06 s
Average latency per token: 0.04 s
Average latency per output token: 0.08 s

Next Steps

Next steps to finalize the PR are:

Running benchmarks on a more powerful system than an L4

Questions

Are there any other things I need to do?

Jan 01 '24 21:01 robertgshaw2-redhat

@simon-mo this addresses #1870. Any feedback would be appreciated.

Additionally, this is my first time contributing to vLLM. I tried to get up to speed on the whole repo/submission process/QA process, but let me know if I am missing anything. Thanks.

Jan 05 '24 03:01 robertgshaw2-redhat

Hi @rib-2, thank you for your contribution. This PR is definitely in the right direction, few things to start:

We are recommending the OpenAI compatible server as the only server, the internal api_server is only to be used for demo and benchmark purpose. Can you limit the change to OpenAI compatible server?
For the change in abstracting metrics capturing, can you make sure to keep the existing logging message intact? It is heavily used as one of the main diagnostic information.
Regarding metrics definition, I find the fn in Counter, Gauge, and Histogram hard to interpret from a code readability and maintainability point of view. Is there a simpler to way to achieve something like this?

Jan 05 '24 13:01 simon-mo

@simon-mo Sounds good. Thanks for the feedback.

Will remove from api_server. I just added that in so I could test the overhead with the existing benchmarking scripts
I will try to refactor the fn piece. I originally had each metric as a new class that inherits from CounterMetric (etc) and it got a bit verbose, but I will think through it again.

For the existing logging message --> are you referring to the logger to the command line?

Jan 05 '24 14:01 robertgshaw2-redhat

For the existing logging message --> are you referring to the logger to the command line?

Yes!

Jan 05 '24 19:01 simon-mo

Update to address @simon-mo 's feedback

Removed /metrics from api_server.py
Added back in local logging. This required a change to PrometheusMetric to hold TrackedStats which stores data over the local_logging_interval=5s to compute metrics like average throughput.
Refactored PrometheusMetric to avoid needing to pass hard to maintain / confusing fn to convert a Stat to the logging format. Instead, the PrometheusMetric now log an attribute of Stats directly. So we can create a PrometheusMetric by passing the aioprometheus collector, the attr of the Stats class to use, and a template for converting the metric to a string for local logging.

CounterMetric(
        counter=counter_prompt_tokens,
        attr="num_prompt_tokens",
        template="Prompt throughput: {:0.2f} tok/sec",
    )

Additionally, the metrics.py file was getting a bit confusing to have the class definitions in the same file as the metric definitions, so I separated into metrics.py which implements the classes and metrics_registry.py which is where the actual metrics are defined.

Jan 07 '24 16:01 robertgshaw2-redhat

python3 benchmark_serving.py --request-rate 5.0 --dataset ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer mistralai/Mistral-7B-v0.1

on A100:

Performance from benchmark_serving with logging enabled:

Total time: 215.30 s
Throughput: 4.64 requests/s
Average latency: 7.73 s
Average latency per token: 0.02 s
Average latency per output token: 0.04 s

Performance from benchmark_serving with logging disabled:

Total time: 214.20 s
Throughput: 4.67 requests/s
Average latency: 7.27 s
Average latency per token: 0.02 s
Average latency per output token: 0.04 s

Jan 07 '24 19:01 robertgshaw2-redhat

@NikolaBorisov thanks for the feedback. Addressed a lot of your concerns in threads.

Additionally addressing your comments:

I don't think the readme, the docker-compose file, the prometheus.yaml file or the simulate_clients files are appropriate. Most people that will monitor vllm with Prometheus already have Prometheus setup and Grafana. I just don't think they are needed.

I agree setting up Grafana and Prometheus is probably not useful here. However, I think the grafana.json dashboard is really the point (to show an example of how someone should use the metrics being logged to prometheus to compute the percentiles for latency). However, its not really possible to have a self-contained example of using the grafana dashboard without setting up prometheus and grafana itself. As a result, I included it as an example for how someone might interpret the metrics being logged, but obviously this can be removed if others do not think it is valuable.

Also, I put it in the examples directory for this purpose.

Logging and the metrics should not be coupled. Only handful of metrics should be logged.

I am confused what you mean by this. Could you clarify further?

@simon-mo requested in #1870 that more metrics be added to the server. vLLM currently supports logging of these metrics locally to the stdout and to Prometheus. I thought it made sense to have a single class MetricsLogger which handles that is separated from LLMEngine to handle this.

Jan 08 '24 23:01 robertgshaw2-redhat

I agree setting up Grafana and Prometheus is probably not useful here. However, I think the grafana.json dashboard is really the point (to show an example of how someone should use the metrics being logged to prometheus to compute the percentiles for latency). However, its not really possible to have a self-contained example of using the grafana dashboard without setting up prometheus and grafana itself. As a result, I included it as an example for how someone might interpret the metrics being logged, but obviously this can be removed if others do not think it is valuable.

Also, I put it in the examples directory for this purpose.

The grafana.json file is good. I think it should stay.

Logging and the metrics should not be coupled. Only handful of metrics should be logged.

I am confused what you mean by this. Could you clarify further?

The log line printed to the standard output is meant to be human readable. It should only log few important metrics. The Prometheus metrics will be a lot more. Most Prometheus metrics we add should not be printed to the logger.

Jan 08 '24 23:01 NikolaBorisov

I agree setting up Grafana and Prometheus is probably not useful here. However, I think the grafana.json dashboard is really the point (to show an example of how someone should use the metrics being logged to prometheus to compute the percentiles for latency). However, its not really possible to have a self-contained example of using the grafana dashboard without setting up prometheus and grafana itself. As a result, I included it as an example for how someone might interpret the metrics being logged, but obviously this can be removed if others do not think it is valuable. Also, I put it in the examples directory for this purpose.

The grafana.json file is good. I think it should stay.

Okay great. Can restructure the example to just have the grafana.json if the rest is not valuable.

Logging and the metrics should not be coupled. Only handful of metrics should be logged.

I am confused what you mean by this. Could you clarify further?

The log line printed to the standard output is meant to be human readable. It should only log few important metrics. The Prometheus metrics will be a lot more. Most Prometheus metrics we add should not be printed to the logger.

Agreed. you can see in the definition of the PrometheusMetric that there is log_local argument which can be passed. If set to False that metric will not be added locally. For instance, the new metrics I added are not logged locally.

Jan 08 '24 23:01 robertgshaw2-redhat

I very much recognize that this PR adds more abstraction / code in contrast to the current "direct" implementation of metric logging where everything is handled in the LLMEngine. I did this to try to remove complexity from the LLMEngine (particularly around maintaining the last 5s worth of statistics in attributes) to make it easier to add new metrics (and even new logging backends - such as supporting AWS CloudWatch for SageMaker deployments) in the future.

Please let me know if this is not something that is useful

Jan 09 '24 00:01 robertgshaw2-redhat

Just my 2 cents. All of the outside (surface visible) changes are good, however the stats classes add a lot of noise for not much in return. The idea to use a Stats class that gets computed in the engine and is then passed to the logging/metrics code is great. However all the rest of the classes add very little. Also, it's not really easy to modify for particular deployment (I.e the code needs to be patched).

If some of the metrics are more time-consuming (i.e make latency worse), it makes sense to put them behind a flag.

Also you can drop all the helper classes and replace them with a command-line supported log line, so people can choose what to see and what not (with a reasonable default, of course), including how often to log (in a separate flag). For running casually on dev machine once in 10s might be fine, but for production 1s might make a lot more sense.

Jan 10 '24 10:01 ichernev

Thanks @ichernev - I will modify the code as you have described

If we want to add more loggers later, having the abstractions might make sense.

Jan 10 '24 20:01 robertgshaw2-redhat

Will remove from api_server. I just added that in so I could test the overhead with the existing benchmarking scripts

In case anyone is still interested in this, I made #2172 a few weeks ago which allows you to use the existing benchmark script with the OpenAI compatible server.

Jan 17 '24 14:01 hmellor

@simon-mo ready for your review

Jan 20 '24 16:01 robertgshaw2-redhat

On A100-80GB mistral-7b with metrics:

python3 ../../benchmarks/benchmark_serving.py     --model mistralai/Mistral-7B-v0.1     --tokenizer mistralai/Mistral-7B-v0.1     --endpoint /v1/completions     --dataset ShareGPT_V3_unfiltered_cleaned_split.json     --request-rate 5.

Average latency: 4.63 s
Average latency per token: 0.01 s
Average latency per output token: 0.02 s

On A100-80GB mistral-7b without metrics:

python3 ../../benchmarks/benchmark_serving.py     --model mistralai/Mistral-7B-v0.1     --tokenizer mistralai/Mistral-7B-v0.1     --endpoint /v1/completions     --dataset ShareGPT_V3_unfiltered_cleaned_split.json     --request-rate 5.

Average latency: 4.53 s
Average latency per token: 0.01 s
Average latency per output token: 0.02 s

Jan 20 '24 19:01 robertgshaw2-redhat

I'm not a reviewer so feel free to ignore me, but this PR might be a nice opportunity to switch to using the official Prometheus Python Client because it is more actively maintained and has many more stars (3.7k vs aioprometheus's 158)?

Jan 22 '24 13:01 hmellor

@HMellor do you happen to know if that client supports asyncio? I can do a deeper diver if needed

I'm not sure why the decision was made for aioprometheus but I thought it might be asyncio related based on the commentary in the library

Jan 22 '24 15:01 robertgshaw2-redhat

@rib-2 @HMellor I asked about the decision of using aioprometheus over the official package in the original PR and here is a link to the response I received https://github.com/vllm-project/vllm/pull/1890#issuecomment-1881491178

Jan 22 '24 15:01 ronensc

In the Prometheus docs it appears that they do support FastAPI in a very similar way to aioprometheus, so it looks like it should just work.

@ronensc I had seen this and was going to reference it but you beat me to it! I interpret the answer as "it appears to work well enough and we don't have strong feelings about switching" so it was left as is.

Personally I'd be in favour of switching to the first-party library if the change is as simple as:

- from aioprometheus import MetricsMiddleware
- from aioprometheus.asgi.starlette import metrics
+ from prometheus_client import make_asgi_app
...
- app.add_middleware(MetricsMiddleware)
- app.add_route("/metrics", metrics)
+ metrics_app = make_asgi_app()
+ app.mount("/metrics", metrics_app)

Jan 22 '24 15:01 hmellor

In the Prometheus docs it appears that they do support FastAPI in a very similar way to aioprometheus, so it looks like it should just work.

@ronensc I had seen this and was going to reference it but you beat me to it! I interpret the answer as "it appears to work well enough and we don't have strong feelings about switching" so it was left as is.

Personally I'd be in favour of switching to the first-party library if the change is as simple as:
- from aioprometheus import MetricsMiddleware
- from aioprometheus.asgi.starlette import metrics
+ from prometheus_client import make_asgi_app
...
- app.add_middleware(MetricsMiddleware)
- app.add_route("/metrics", metrics)
+ metrics_app = make_asgi_app()
+ app.mount("/metrics", metrics_app)

I will try this and do a performance testing to confirm it works okay

Jan 22 '24 16:01 robertgshaw2-redhat

In the Prometheus docs it appears that they do support FastAPI in a very similar way to aioprometheus, so it looks like it should just work.

I read the code and the Prometheus Python Client should also work. We added the metrics to vllm originally, and we just used aioprometheus because we were using it in another code async code-base. Should be ok to switch, but I would do it in another PR.

Jan 24 '24 02:01 NikolaBorisov

In the Prometheus docs it appears that they do support FastAPI in a very similar way to aioprometheus, so it looks like it should just work.

I read the code and the Prometheus Python Client should also work. We added the metrics to vllm originally, and we just used aioprometheus because we were using it in another code async code-base. Should be ok to switch, but I would do it in another PR.

I like the idea of doing it in another PR

Jan 24 '24 13:01 robertgshaw2-redhat

Only outstanding item I think is the time_per_output_token calculation. I left some thoughts @simon-mo, I believe the way we are doing it now works well and calculates properly. LMK if my comments make sense

fixed aesthetic changes
fixed the metric names so that they don't break the existing use (@NikolaBorisov's request)

@simon-mo requesting re-review

Jan 26 '24 15:01 robertgshaw2-redhat

vllm vllm copied to clipboard

Rs/feature/metrics

Summary

A) Implementation

B) Example

C) Best Practices

Overhead

Next Steps

Questions

vllm
vllm copied to clipboard