vllm add spec infer related into prometheus metrics.

And add a new boost_ratio metric used to directly show how much spec infer would help in saving decoding steps.

May 03 '24 15:05 leiwen83

cc @cadedaniel

May 03 '24 15:05 leiwen83

will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)

May 03 '24 18:05 cadedaniel

will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)

+1

May 03 '24 19:05 robertgshaw2-redhat

Thanks for the contribution! It would be great to have these metrics flowing through prometheus!

May 03 '24 19:05 robertgshaw2-redhat

will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)

the new boost_ratio would express more accurate expression at how much system is benefit from spec info, as there is case that spec info give no proposal, like no matching in ngram or seqlen+spec exceed over model length.

Furthermore, with the new dynamic spec coming https://github.com/vllm-project/vllm/issues/4565, the k would not be constant one, so that we may need accumulate actual token emitted comparing with the steps.

May 04 '24 02:05 leiwen83

@cadedaniel @robertgshaw2-neuralmagic Any comment for the latest PR change? :)

May 08 '24 14:05 leiwen83

asking @LiuXiaoxuanPKU if she has bandwidth to review the PR. the approach looks good to me, concerns are 1) we should make sure the top-level metrics make sense to users (not just to us as developers), 2) the naming of the metrics collection seems weird

May 09 '24 21:05 cadedaniel

reviewed

cade + i discussing a path fwd

May 09 '24 21:05 robertgshaw2-redhat

Hi @robertgshaw2-neuralmagic @cadedaniel ,

How is going with the spec related metric, have we got the conclusion for how to make it happen? ;) The metric is critical to us as a direct feedback reflecting how well current spec sys is doing.

May 17 '24 10:05 leiwen83

thanks & sorry this slipped. I might have time tomorrow to finish review. cc @LiuXiaoxuanPKU and @comaniac who might have bandwidth.

May 23 '24 17:05 cadedaniel

@cadedaniel I submit a rebased PR, which keep the concat logic as before. num_spec is made to aggregate "k" number.

Jun 07 '24 02:06 leiwen83