text-generation-inference [Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes

[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes

Open EandrewJones opened this issue 8 months ago • 11 comments

Feature request

TGI provides some valuable metrics on model performance and load today. However, there are still a number of missing metrics, the absence of which poses a challenge for orchestration and autoscaling in Kubernetes.

Here is a list of the metrics that we (K8s serving WG, see below) have identified for inclusion TGI:

Metric Name	Type	Unit	Implemented by TGI Already
model_load_time	Counter	Seconds
time_per_output_token_per_batch_size	Histogram	Milliseconds
request_wait_time (total time - time spent on inference)	Hisogram	Milliseconds
request_queue_time	Histogram	Milliseconds	? (tgi_request_queue_duration)
max_token_capacity	Counter	Tokens
time_per_prefill_token	Histogram	Milliseconds
total_tokens_in_current_batch	Gauge	Tokens
time_to_first_token	Histogram	Milliseconds
estimated_max_prefill_tokens_per_second	Gauge	Tokens
estimated_max_batch_before_compute_saturation	Gauge	Tokens
request_input_length	Histogram	Tokens	$\checkmark$ (tgi_request_input_length)
request_output_length	Histogram	Tokens	$\checkmark$ (tgi_request_generated_tokens)
request_with_evicted_tokens	Counter	Count
total_evicted_tokens	Counter	Tokens

Additional Context

I believe TGI already uses OTel. OTel is in the process of adding support for LLM metrics which TGI may be able to piggyback off for some of the above. For reference, see OTel's LLM Semantic Convention WG (Please request access if you are not able to view it).

cc @Narsil @drbh

Motivation

If added, these metrics make it easier for orchestrators like Kubernetes to provide better support for autoscaling TGI servers or distributing load more efficiently. We have a proposal in the Kubernetes Serving WG to add these additional metrics to popular model servers. We want to add these to TGI as well.

Google doc link to the proposal which has the set of metrics we want to add and the reasoning behind it - https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk/edit?usp=sharing&resourcekey=0-ob5dR-AJxLQ5SvPlA4rdsg. (Please request access if you are not able to view it)

Your contribution

I am happy to shepherd this work from the K8s WG-side. I can contribute code where as my bandwidth permits and where it makes sense. That said, I am not yet super familiar with the TGI code base. It would be great to have one or more champions from the TGI contributor side as well.

May 29 '24 18:05 EandrewJones

text-generation-inference text-generation-inference copied to clipboard

[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes

Feature request

Additional Context

Motivation

Your contribution

text-generation-inference
text-generation-inference copied to clipboard