serve Confused about Cumulative Inference Duration vs. PredictionTime

📚 The doc issue

I am running a model on TorchServe and I am trying to see how long it takes for inference. If I use logging and view the logs, then I can see there is something called PredictionTime:

However, if I use the Metrics API, then I got something called "Cumulative Inference Duration"

And in terms of values those 2 are very different. So I am not sure which one should I use to measure the total inference time for my requests?

Btw, there is also something else called HandlerTime in the logs

What does it mean? Where can I find related information about what are the meanings of these metrics?

Thanks,

Suggest a potential alternative/fix

No response

Jun 20 '22 18:06 Hegelim

Thanks for all the recent feedback @Hegelim - since you're exploring our metrics story I'd recommend you check out our RFC #1492 and provide any feedback there. I agree that right now figuring out what each metric is from source code is not ideal

HandlerTime is the time it takes to run the handle() function in the base handler https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py#L248 so this will be the time you observe your handler take assuming you run it without torchserve so something like #1538 and more importantly this is the time you control depending on how slow/fast your model is
PredictionTime is the time it takes the python backend to run your handler https://github.com/pytorch/serve/blob/master/ts/service.py#L123 so think of this as torchserve overhead from our python code
Cumulative inference duration is defined here https://github.com/pytorch/serve/blob/master/frontend/server/src/main/java/org/pytorch/serve/metrics/format/prometheous/PrometheusMetricManager.java#L14-L34 and those are cumulative counters over the number of requests and are defined here https://github.com/pytorch/serve/blob/b11ba0fa97ceea7003bfa7c4fd4203f36fee3d34/frontend/server/src/main/java/org/pytorch/serve/job/RestJob.java#L73-L97 - you can think of these as the end to end overhead including java, python and your handler code

Jun 20 '22 19:06 msaroufim

I am still confused - maybe because I am not very versed in Java. So if I understand you correctly, Cumulative inference duration should be the largest, right? Because it is the end-to-end overhead including everything. However, according to my screenshots above, Cumulative inference duration is much smaller than PredictionTime, the former one is around 6 ms, the latter one is around 3 s.

Jun 20 '22 21:06 Hegelim

@Hegelim

in the model_metrics.log: HandlerTime/PredictionTime: records each individual inference requests latency.
in metrics API: all of the values (eg. Cumulative Inference Duration) are all inference requests stats results during the time interval calculated by Prometheus.

Jul 08 '22 18:07 lxning

serve serve copied to clipboard

Confused about Cumulative Inference Duration vs. PredictionTime

📚 The doc issue

Suggest a potential alternative/fix

serve
serve copied to clipboard