text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes

Open EandrewJones opened this issue 8 months ago • 11 comments

Feature request

TGI provides some valuable metrics on model performance and load today. However, there are still a number of missing metrics, the absence of which poses a challenge for orchestration and autoscaling in Kubernetes.

Here is a list of the metrics that we (K8s serving WG, see below) have identified for inclusion TGI:

Metric Name Type Unit Implemented by TGI Already
model_load_time Counter Seconds
time_per_output_token_per_batch_size Histogram Milliseconds
request_wait_time (total time - time spent on inference) Hisogram Milliseconds
request_queue_time Histogram Milliseconds ? (tgi_request_queue_duration)
max_token_capacity Counter Tokens
time_per_prefill_token Histogram Milliseconds
total_tokens_in_current_batch Gauge Tokens
time_to_first_token Histogram Milliseconds
estimated_max_prefill_tokens_per_second Gauge Tokens
estimated_max_batch_before_compute_saturation Gauge Tokens
request_input_length Histogram Tokens $\checkmark$ (tgi_request_input_length)
request_output_length Histogram Tokens $\checkmark$ (tgi_request_generated_tokens)
request_with_evicted_tokens Counter Count
total_evicted_tokens Counter Tokens

Additional Context

I believe TGI already uses OTel. OTel is in the process of adding support for LLM metrics which TGI may be able to piggyback off for some of the above. For reference, see OTel's LLM Semantic Convention WG (Please request access if you are not able to view it).

cc @Narsil @drbh

Motivation

If added, these metrics make it easier for orchestrators like Kubernetes to provide better support for autoscaling TGI servers or distributing load more efficiently. We have a proposal in the Kubernetes Serving WG to add these additional metrics to popular model servers. We want to add these to TGI as well.

Google doc link to the proposal which has the set of metrics we want to add and the reasoning behind it - https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk/edit?usp=sharing&resourcekey=0-ob5dR-AJxLQ5SvPlA4rdsg. (Please request access if you are not able to view it)

Your contribution

I am happy to shepherd this work from the K8s WG-side. I can contribute code where as my bandwidth permits and where it makes sense. That said, I am not yet super familiar with the TGI code base. It would be great to have one or more champions from the TGI contributor side as well.

EandrewJones avatar May 29 '24 18:05 EandrewJones