Whats the query to calculate triton model latency per request? Is it nv_inference_request_duration_us / nv_inference_exec_count + nv_inference_queue_duration_us

Open jayakommuru opened this issue 1 year ago • 1 comments

We are doing benchmarking of triton with different backends, but unable to get the metric the calculate the latency of each request (lets assume each request has batch size of b)

Is request latency = rate(nv_inference_request_duration_us[1m]) / rate(nv_inference_exec_count[1m]) + nv_inference_queue_duration_us?
Does nv_inference_request_duration_us include the queuing duration as well ? In documentation, it says its cumulative. can any one confirm?
Are compute_input and compute_output duration also included in the nv_inference_request_duration_us ?

Oct 11 '24 00:10 jayakommuru

@oandreeva-nv can you help with this ?

Oct 11 '24 00:10 jayakommuru