add spec infer related into prometheus metrics.
And add a new boost_ratio metric used to directly show how much spec infer would help in saving decoding steps.
cc @cadedaniel
will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)
will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)
+1
Thanks for the contribution! It would be great to have these metrics flowing through prometheus!
will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)
the new boost_ratio would express more accurate expression at how much system is benefit from spec info, as there is case that spec info give no proposal, like no matching in ngram or seqlen+spec exceed over model length.
Furthermore, with the new dynamic spec coming https://github.com/vllm-project/vllm/issues/4565, the k would not be constant one, so that we may need accumulate actual token emitted comparing with the steps.
@cadedaniel @robertgshaw2-neuralmagic Any comment for the latest PR change? :)
asking @LiuXiaoxuanPKU if she has bandwidth to review the PR. the approach looks good to me, concerns are 1) we should make sure the top-level metrics make sense to users (not just to us as developers), 2) the naming of the metrics collection seems weird
reviewed
cade + i discussing a path fwd
Hi @robertgshaw2-neuralmagic @cadedaniel ,
How is going with the spec related metric, have we got the conclusion for how to make it happen? ;) The metric is critical to us as a direct feedback reflecting how well current spec sys is doing.
thanks & sorry this slipped. I might have time tomorrow to finish review. cc @LiuXiaoxuanPKU and @comaniac who might have bandwidth.
@cadedaniel I submit a rebased PR, which keep the concat logic as before. num_spec is made to aggregate "k" number.