serve icon indicating copy to clipboard operation
serve copied to clipboard

Confused about Cumulative Inference Duration vs. PredictionTime

Open Hegelim opened this issue 3 years ago • 3 comments

📚 The doc issue

I am running a model on TorchServe and I am trying to see how long it takes for inference. If I use logging and view the logs, then I can see there is something called PredictionTime: image

However, if I use the Metrics API, then I got something called "Cumulative Inference Duration" image

And in terms of values those 2 are very different. So I am not sure which one should I use to measure the total inference time for my requests?

Btw, there is also something else called HandlerTime in the logs image

What does it mean? Where can I find related information about what are the meanings of these metrics?

Thanks,

Suggest a potential alternative/fix

No response

Hegelim avatar Jun 20 '22 18:06 Hegelim

Thanks for all the recent feedback @Hegelim - since you're exploring our metrics story I'd recommend you check out our RFC #1492 and provide any feedback there. I agree that right now figuring out what each metric is from source code is not ideal

  • HandlerTime is the time it takes to run the handle() function in the base handler https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py#L248 so this will be the time you observe your handler take assuming you run it without torchserve so something like #1538 and more importantly this is the time you control depending on how slow/fast your model is
  • PredictionTime is the time it takes the python backend to run your handler https://github.com/pytorch/serve/blob/master/ts/service.py#L123 so think of this as torchserve overhead from our python code
  • Cumulative inference duration is defined here https://github.com/pytorch/serve/blob/master/frontend/server/src/main/java/org/pytorch/serve/metrics/format/prometheous/PrometheusMetricManager.java#L14-L34 and those are cumulative counters over the number of requests and are defined here https://github.com/pytorch/serve/blob/b11ba0fa97ceea7003bfa7c4fd4203f36fee3d34/frontend/server/src/main/java/org/pytorch/serve/job/RestJob.java#L73-L97 - you can think of these as the end to end overhead including java, python and your handler code

msaroufim avatar Jun 20 '22 19:06 msaroufim

I am still confused - maybe because I am not very versed in Java. So if I understand you correctly, Cumulative inference duration should be the largest, right? Because it is the end-to-end overhead including everything. However, according to my screenshots above, Cumulative inference duration is much smaller than PredictionTime, the former one is around 6 ms, the latter one is around 3 s.

Hegelim avatar Jun 20 '22 21:06 Hegelim

@Hegelim

  • in the model_metrics.log: HandlerTime/PredictionTime: records each individual inference requests latency.
  • in metrics API: all of the values (eg. Cumulative Inference Duration) are all inference requests stats results during the time interval calculated by Prometheus.

lxning avatar Jul 08 '22 18:07 lxning