server Histogram Metric for multi-instance tail latency aggregation

Is your feature request related to a problem? Please describe. This issue is similar to the one mentioned here: https://github.com/triton-inference-server/server/issues/7287. I'd like to file an issue for histogram metric in Triton core. I remember this being mentioned as being in the backlog from the previous issue but would like to have this for tracking purposes.

Currently it isn't possible to calculate 95th or 99th percentile latencies when we deploy multiple Triton servers to host models. This also affects scaling decisions from being imprecise.

Describe the solution you'd like Use histogram-quantile instead of summaries as the Prometheus metric exporter: https://prometheus.io/docs/practices/histograms/#:~:text=The%20essential%20difference%20between%20summaries,the%20server%20side%20using%20the

Describe alternatives you've considered I've tried the wrong way of doing it (Avg of P95 across different Triton servers) but that is not the true p95.

I've also tried using the Avg queue time/latency (keeping those very low) instead but that doesn't help with dealing with tail latencies during spikes.

cc: @rmccorm4

Oct 01 '24 19:10 AshwinAmbal

Hi @AshwinAmbal, thanks for adding a ticket for tracking!

CC @yinggeh @harryskim @statiraju for viz

Oct 01 '24 19:10 rmccorm4

Hi @rmccorm4 @yinggeh @harryskim @statiraju I had a question on the Histogram support added in response to this issue. Does the histogram metric shown here: https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md#histograms, apply to end to end latencies that we get in Traditional ML systems as well [non-LLM]?

If we have a normal 1 request has 1 response type ML system, will the metric nv_inference_first_response_histogram_ms give the end-to-end latency which can be used to aggregate across multiple Triton instances?

TLDR; for my specific case, can I consider that the above metric is the "histogram" version of the summary metric here: nv_inference_request_summary_us?

Thanks

Mar 14 '25 17:03 AshwinAmbal

Hi @AshwinAmbal. I am happy to answer your questions.

Currently, TTFT histogram metric will only appear in decoupled models. If your traditional ML model is decoupled you should see this metric. If your model is not decoupled, are you able to modify it into a decoupled model but always returns one response? See https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/decoupled_models.html

Note: Histogram metrics provide sum, count and the distribution of data. You cannot infer exact latency of a request from thehistogram directly. You can only aggregate across multiple Triton instances if their histogram boundary buckets are the same.

Mar 14 '25 20:03 yinggeh

@yinggeh thanks for the speedy response.

I may not be able to use decoupled mode easily because it requires gRPC Bidirectional streaming from the client side which would be a limitation for different clients to connect to my service.

Btw, my goal is simple: I want to interpret P50, P95 and P99 latencies across all my Triton servers running in K8s together. Currently, each server emits the summary latency nv_inference_request_summary_us which is not aggregatable across multiple Triton servers [as it doesn't emit the distribution itself and only calculates the percentiles locally considering only the requests received by the server itself].

Similar work had been done on the Triton Python backend as seen in #7287 and hence I was requesting the same to be added to Triton core as well so that we can use it in general.

Any idea if this is still being planned to be delivered or if there is another way to achieve the same?

Mar 14 '25 20:03 AshwinAmbal

@AshwinAmbal Correct. You cannot directly aggregate percentiles from a Summary metric. To compute percentiles using a Histogram, refer to Prometheus’ histogram_quantile() function.

When I added nv_inference_first_response_histogram_ms metric to Triton, two new metrics—TTFT and TPOT (not yet availble)—were intended to introduced for LLM models. I expanded the coverage to decoupled models and renamed "time to first token" to "time to first response". While this metric was designed for decoupled models, it should also work with non-decoupled models, though the naming might be odd.

If this is urgent, you can try the following modification and rebuild the server. Let me know what you think.

--- a/src/metric_model_reporter.cc
+++ b/src/metric_model_reporter.cc
@@ -308,10 +308,8 @@ MetricModelReporter::InitializeHistograms(
   // Update MetricReporterConfig::metric_map_ for new histograms.
   // Only create response metrics if decoupled model to reduce metric output
   if (config_.latency_histograms_enabled_) {
-    if (config_.is_decoupled_) {
       histogram_families_[kFirstResponseHistogram] =
           &Metrics::FamilyFirstResponseDuration();
-    }
   }
 
   for (auto& iter : histogram_families_) {

Mar 19 '25 09:03 yinggeh

Any idea if this is still being planned to be delivered or if there is another way to achieve the same?

Can you provide a list of latency Histograms for your deployment?

Mar 19 '25 10:03 yinggeh

@yinggeh This is important for auto-scaling Triton servers in Kubernetes while monitoring latency as the metric. Currently, this isn't possible.

Ideally what I would like to see is all the summary metrics supported today also available as histogram type.

In particular, I am interested mainly in nv_inference_request_summary_us as a histogram metric.

I can try the workaround that you suggested to see if it works but it would be nice to have an actual solution for this as it seems like a foundational metric to me. WDYT?

Mar 22 '25 00:03 AshwinAmbal

hi @AshwinAmbal does --metrics-config histogram_latencies=true work for you to display histograms? it not displaying any histograms for me...

any idea why i histograms are not available using --metrics-config histogram_latencies=true? i have an ensemble model with: pre-processing (cpu) -> trt model (gpu) -> post-processing (cpu)

cc @rmccorm4 @yinggeh @harryskim @statiraju

Apr 03 '25 15:04 geraldstanje

@geraldstanje I believe there is only one histogram metric available today: nv_inference_first_response_histogram_ms according to Triton docs here and it only works for LLM/Decoupled models from the description from @yinggeh

About their working, I'll have to redirect you to a Triton maintainer. I haven't begun using the histogram metric yet because the one I want isn't available yet.

Gentle bump to @yinggeh for this and my previous query please.

Thanks

Apr 03 '25 16:04 AshwinAmbal

@AshwinAmbal i use --metrics-config histogram_latencies=true and dont see any histogram. do you see nv_inference_first_response_histogram_ms?

what are a LLM/Decoupled models? i use a modernBERT with tensorRT: pre-processing (cpu) -> trt model (is modernBERT model running on gpu) -> post-processing (cpu)

Apr 03 '25 16:04 geraldstanje

Hi @geraldstanje. Please find details on decoupled models here https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/decoupled_models.html. In short, decoupled models can send zero or multiple responses for a request.

To enable decoupled mode, you need to config model properly with

model_transaction_policy {
  decoupled: True
}

Jun 02 '25 03:06 yinggeh

@yinggeh could you explain a bit more about the reasoning for only enabling histogram metrics for decoupled models? It would be useful to have in order to track P99 e2e latency of our Triton server given that the available metrics only allow us to see the average latency. Thanks.

Aug 08 '25 07:08 amannedd

Hi @amannedd. Sorry for the late response. By that time, we received requests to add two histogram metrics for LLM models (TTFT and TPOT). LLM models in nature are decoupled models in Triton.

If you feel like some particular histogram metrics for non-decoupled models are useful in your production, I encourage to post a feature request in https://github.com/triton-inference-server/server/issues with expected behavior and output. Thanks.

Aug 18 '25 16:08 yinggeh

server server copied to clipboard

Histogram Metric for multi-instance tail latency aggregation

server
server copied to clipboard