text-generation-inference
text-generation-inference copied to clipboard
[Feature]: Additional metrics to enable better autoscaling / load balancing of TGI servers in Kubernetes
Feature request
TGI provides some valuable metrics on model performance and load today. However, there are still a number of missing metrics, the absence of which poses a challenge for orchestration and autoscaling in Kubernetes.
Here is a list of the metrics that we (K8s serving WG, see below) have identified for inclusion TGI:
Metric Name | Type | Unit | Implemented by TGI Already |
---|---|---|---|
model_load_time | Counter | Seconds | |
time_per_output_token_per_batch_size | Histogram | Milliseconds | |
request_wait_time (total time - time spent on inference) | Hisogram | Milliseconds | |
request_queue_time | Histogram | Milliseconds | ? (tgi_request_queue_duration) |
max_token_capacity | Counter | Tokens | |
time_per_prefill_token | Histogram | Milliseconds | |
total_tokens_in_current_batch | Gauge | Tokens | |
time_to_first_token | Histogram | Milliseconds | |
estimated_max_prefill_tokens_per_second | Gauge | Tokens | |
estimated_max_batch_before_compute_saturation | Gauge | Tokens | |
request_input_length | Histogram | Tokens | $\checkmark$ (tgi_request_input_length) |
request_output_length | Histogram | Tokens | $\checkmark$ (tgi_request_generated_tokens) |
request_with_evicted_tokens | Counter | Count | |
total_evicted_tokens | Counter | Tokens |
Additional Context
I believe TGI already uses OTel. OTel is in the process of adding support for LLM metrics which TGI may be able to piggyback off for some of the above. For reference, see OTel's LLM Semantic Convention WG (Please request access if you are not able to view it).
cc @Narsil @drbh
Motivation
If added, these metrics make it easier for orchestrators like Kubernetes to provide better support for autoscaling TGI servers or distributing load more efficiently. We have a proposal in the Kubernetes Serving WG to add these additional metrics to popular model servers. We want to add these to TGI as well.
Google doc link to the proposal which has the set of metrics we want to add and the reasoning behind it - https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk/edit?usp=sharing&resourcekey=0-ob5dR-AJxLQ5SvPlA4rdsg. (Please request access if you are not able to view it)
Your contribution
I am happy to shepherd this work from the K8s WG-side. I can contribute code where as my bandwidth permits and where it makes sense. That said, I am not yet super familiar with the TGI code base. It would be great to have one or more champions from the TGI contributor side as well.