vllm Add more Prometheus metrics

This PR adds the Prometheus metrics defined in #2650

Feb 05 '24 15:02 ronensc

@rib-2, I highly value your opinion. Would you please review my pull request?

Feb 12 '24 16:02 ronensc

@robertgshaw2-neuralmagic

Feb 28 '24 00:02 simon-mo

@simon-mo Could you please review this PR?

Mar 18 '24 08:03 ronensc

I've added 3 new panels at the bottom of the Grafana dashboard showcasing the metrics vllm:request_prompt_tokens_bucket, vllm:request_generation_tokens_bucket and vllm:request_success_total. I haven't included panels for n and best_of since the current benchmarking script sets a constant value to best_of and doesn't set n at all.

I ran the benchmark on gpt2 model as I couldn't fit mistralai/Mistral-7B-v0.1 onto my GPU:

# Start vLLM
python3 -m vllm.entrypoints.openai.api_server --model gpt2 --disable-log-requests

# Run benchmarking script
python3 ../../benchmarks/benchmark_serving.py --model gpt2 --tokenizer gpt2 --endpoint /v1/completions --dataset ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 10.0

Apr 17 '24 15:04 ronensc

wow cool histogram visuals!

Apr 18 '24 08:04 simon-mo

@ronensc is this ready for review?

Apr 18 '24 17:04 robertgshaw2-redhat

@ronensc Left some comments.

I also have been working on a branch for expanding metrics. Im going to make a PR to your branch if thats okay.

Apr 18 '24 23:04 robertgshaw2-redhat

https://github.com/ronensc/vllm/pull/1

^ draft PR to fix the issues I identified above + added a couple more metrics

@ronensc if you can finish this off it would be great. Otherwise, I will finish it off over the weekend.

Apr 19 '24 02:04 robertgshaw2-redhat

Thanks @ronensc

This is a bug on main right now, so will get merged asap once ready to go

Apr 19 '24 15:04 robertgshaw2-redhat

I see. I'll do my best to deliver ASAP

Apr 19 '24 16:04 ronensc

I've incorporated most of the changes from https://github.com/ronensc/vllm/pull/1 into this PR.
I'm not sure the whether assumption in maybe_get_last_latency() and maybe_set_first_token_time() is correct. Both methods are called after self.output_processor.process_outputs() (and specifically after seq_group.update_num_computed_tokens()). By this point, when the first generation token is ready, it is already added to the sequence, so the state of seq_group is changed from PREFILL to DECODE and get_num_uncomputed_tokens() == 1. For maybe_set_first_token_time(), I suggest using the condition self.get_seqs()[0].get_output_len() == 1 to determine when the first token is generated. As for maybe_get_last_latency(), I suggest using the condition self.is_prefill() to check when chunked_prefill is ongoing.
I modified num_generation_tokens_iter += 1 to num_generation_tokens_iter += seq_group.num_unfinished_seqs() to accommodate requests with more than one sequence (like beam search and parallel sampling).
I'll postpone adding the additional metrics until we get the current set of metrics right.

Note: I applied the changes from https://github.com/vllm-project/vllm/pull/4150 locally to aid in debugging.

Apr 20 '24 11:04 ronensc

@ronensc is this ready for review?

Apr 21 '24 23:04 robertgshaw2-redhat

In current state of the PR, some of the metrics are still inaccurate in chunked_prefill. Before addressing the chunked_prefill issue, could we please merge this PR up to commit https://github.com/vllm-project/vllm/pull/2764/commits/5ded719d9cfd09b300562ec1d4df21fb0a4e79a3 (before attempting to solve the chunked_prefill issue)? We can tackle the chunked_prefill problem in a follow-up PR. What do you think? Also, just a heads-up, I'll be less available in the coming days.

Apr 22 '24 13:04 ronensc

Huge thanks for all the work on this and reviews @ronensc @robertgshaw2-neuralmagic @hmellor

Apr 25 '24 23:04 njhill

Im just thinking though

"I'm not sure the whether assumption in maybe_get_last_latency() and maybe_set_first_token_time() is correct. Both methods are called after self.output_processor.process_outputs() (and specifically after seq_group.update_num_computed_tokens()). By this point, when the first generation token is ready, it is already added to the sequence, so the state of seq_group is changed from PREFILL to DECODE and get_num_uncomputed_tokens() == 1. For maybe_set_first_token_time(), I suggest using the condition self.get_seqs()[0].get_output_len() == 1 to determine when the first token is generated. As for maybe_get_last_latency(), I suggest using the condition self.is_prefill() to check when chunked_prefill is ongoing."

Will merge this weekend

Apr 26 '24 01:04 robertgshaw2-redhat

@simon-mo @njhill had to make a couple changes for correctness due to some subtlety with chunked_prefill

Mind giving brief stamp?

Apr 28 '24 21:04 robertgshaw2-redhat

vllm vllm copied to clipboard

Add more Prometheus metrics

vllm
vllm copied to clipboard