tensorrtllm_backend Add composite metrics for kubernetes inference gateway metrics protocol

Add composite metrics for kubernetes inference gateway metrics protocol

Open BenjaminBraunDev opened this issue 7 months ago • 5 comments

In order to integration Triton Inference Server (specifically with TensorRT-LLM backend) with Gateway API Inference Extension, it must adhere to Gateway's Model Server Protocol. This protocol requires the model server to publish the following prometheus metrics under some consistent family/labels:

TotalQueuedRequests
KVCacheUtilization

Currently TensorRT-LLM backend pipes the the following TensorRT-LLM batch manager statistics as prometheus metrics:

Active Request Count
Scheduled Requests
Max KV cache blocks
Used KV cache blocks

These are realized as the following prometheus metrics:

nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="active",version="1"} nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="scheduled",version="1"} ... nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="max",model="tensorrt_llm",version="1"} nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="used",model="tensorrt_llm",version="1"}

These current metrics are sufficient to compose the Gateway metrics by adding a the following new metrics:

Waiting Requests = Active Request Count - Scheduled Requests
Fraction used KV cache blocks = Max KV cache blocks - Used KV cache blocks

and add these to the existing prometheus metrics:

nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="active",version="1"} nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="scheduled",version="1"} nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="waiting",version="1"} ... nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="max",model="tensorrt_llm",version="1"} nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="used",model="tensorrt_llm",version="1"} nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="fraction",model="tensorrt_llm",version="1"}

These can then be mapped directly to the metrics in the Gateway protocol, allowing for integration with Gateway's End-Point Picker for load balancing.

Mar 17 '25 19:03 BenjaminBraunDev

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Add composite metrics for kubernetes inference gateway metrics protocol

tensorrtllm_backend
tensorrtllm_backend copied to clipboard