server icon indicating copy to clipboard operation
server copied to clipboard

Whats the query to calculate triton model latency per request? Is it nv_inference_request_duration_us / nv_inference_exec_count + nv_inference_queue_duration_us

Open jayakommuru opened this issue 1 year ago • 1 comments

We are doing benchmarking of triton with different backends, but unable to get the metric the calculate the latency of each request (lets assume each request has batch size of b)

  1. Is request latency = rate(nv_inference_request_duration_us[1m]) / rate(nv_inference_exec_count[1m]) + nv_inference_queue_duration_us?
  2. Does nv_inference_request_duration_us include the queuing duration as well ? In documentation, it says its cumulative. can any one confirm?
  3. Are compute_input and compute_output duration also included in the nv_inference_request_duration_us ?

jayakommuru avatar Oct 11 '24 00:10 jayakommuru

@oandreeva-nv can you help with this ?

jayakommuru avatar Oct 11 '24 00:10 jayakommuru