vllm
vllm copied to clipboard
Add more Prometheus metrics
This PR adds the Prometheus metrics defined in #2650
@rib-2, I highly value your opinion. Would you please review my pull request?
@robertgshaw2-neuralmagic
@simon-mo Could you please review this PR?
I've added 3 new panels at the bottom of the Grafana dashboard showcasing the metrics vllm:request_prompt_tokens_bucket
, vllm:request_generation_tokens_bucket
and vllm:request_success_total
.
I haven't included panels for n
and best_of
since the current benchmarking script sets a constant value to best_of
and doesn't set n
at all.
I ran the benchmark on gpt2 model as I couldn't fit mistralai/Mistral-7B-v0.1
onto my GPU:
# Start vLLM
python3 -m vllm.entrypoints.openai.api_server --model gpt2 --disable-log-requests
# Run benchmarking script
python3 ../../benchmarks/benchmark_serving.py --model gpt2 --tokenizer gpt2 --endpoint /v1/completions --dataset ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 10.0
wow cool histogram visuals!
@ronensc is this ready for review?
@ronensc Left some comments.
I also have been working on a branch for expanding metrics. Im going to make a PR to your branch if thats okay.
https://github.com/ronensc/vllm/pull/1
^ draft PR to fix the issues I identified above + added a couple more metrics
@ronensc if you can finish this off it would be great. Otherwise, I will finish it off over the weekend.
Thanks @ronensc
This is a bug on main right now, so will get merged asap once ready to go
I see. I'll do my best to deliver ASAP
- I've incorporated most of the changes from https://github.com/ronensc/vllm/pull/1 into this PR.
- I'm not sure the whether assumption in
maybe_get_last_latency()
andmaybe_set_first_token_time()
is correct. Both methods are called afterself.output_processor.process_outputs()
(and specifically afterseq_group.update_num_computed_tokens()
). By this point, when the first generation token is ready, it is already added to the sequence, so the state ofseq_group
is changed fromPREFILL
toDECODE
andget_num_uncomputed_tokens() == 1
. Formaybe_set_first_token_time()
, I suggest using the conditionself.get_seqs()[0].get_output_len() == 1
to determine when the first token is generated. As formaybe_get_last_latency()
, I suggest using the conditionself.is_prefill()
to check when chunked_prefill is ongoing. - I modified
num_generation_tokens_iter += 1
tonum_generation_tokens_iter += seq_group.num_unfinished_seqs()
to accommodate requests with more than one sequence (like beam search and parallel sampling). - I'll postpone adding the additional metrics until we get the current set of metrics right.
Note: I applied the changes from https://github.com/vllm-project/vllm/pull/4150 locally to aid in debugging.
@ronensc is this ready for review?
In current state of the PR, some of the metrics are still inaccurate in chunked_prefill. Before addressing the chunked_prefill issue, could we please merge this PR up to commit https://github.com/vllm-project/vllm/pull/2764/commits/5ded719d9cfd09b300562ec1d4df21fb0a4e79a3 (before attempting to solve the chunked_prefill issue)? We can tackle the chunked_prefill problem in a follow-up PR. What do you think? Also, just a heads-up, I'll be less available in the coming days.
Huge thanks for all the work on this and reviews @ronensc @robertgshaw2-neuralmagic @hmellor
Im just thinking though
"I'm not sure the whether assumption in maybe_get_last_latency() and maybe_set_first_token_time() is correct. Both methods are called after self.output_processor.process_outputs() (and specifically after seq_group.update_num_computed_tokens()). By this point, when the first generation token is ready, it is already added to the sequence, so the state of seq_group is changed from PREFILL to DECODE and get_num_uncomputed_tokens() == 1. For maybe_set_first_token_time(), I suggest using the condition self.get_seqs()[0].get_output_len() == 1 to determine when the first token is generated. As for maybe_get_last_latency(), I suggest using the condition self.is_prefill() to check when chunked_prefill is ongoing."
Will merge this weekend
@simon-mo @njhill had to make a couple changes for correctness due to some subtlety with chunked_prefill
Mind giving brief stamp?