ml-commons
ml-commons copied to clipboard
[FEATURE] Addition of metrics for local/remote ml model latency
Is your feature request related to a problem? Current with _ml/profile, we provide over all latency of _ml/predict API including invocation call to model be it local or remote. There isn't any metics that helps customers to view the breakdown view of latency for _ml/predict API.
What solution would you like? There is a need to add P50, P75, P90, P99 Latency metrics for just the predict call to the ml model, be it local or remote.
There isn't any metics that helps customers to view the breakdown view of latency for _ml/predict API.
Can you elaborate what level of breakdown ? The current _profile API provides end to end and latency and model part latency
this is what _ml/profile contails
"predict_request_stats" : { # predict request stats on this node
"count" : 2, # total predict requests on this node
"max" : 89.978681, # max latency in milliseconds
"min" : 5.402,
"average" : 47.6903405,
"p50" : 47.6903405,
"p90" : 81.5210129,
"p99" : 89.13291418999998
}
Following things are not clear from this
- End to end latency for _ml/predict API --> I am assuming P90 reflects this one.
- Latency for calling Sagemaker or Bedrock API that is registered by CX
We need to make sure that second part is highlighted clearly.
"models": {
"TJDEX44BOcjlx1BH-HyR": {
"model_state": "DEPLOYED",
"predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@6aa598ca",
"target_worker_nodes": [
"fIlV_8tmSpWU90nC_AjFsA",
"fGtuXeLDSXu2VfHou1OciA",
"EAU1Mpq-S9yNwFkPeViVug",
"xALf-OTSTRGC45EGqsgQaw"
],
"worker_nodes": [
"fIlV_8tmSpWU90nC_AjFsA",
"fGtuXeLDSXu2VfHou1OciA",
"EAU1Mpq-S9yNwFkPeViVug",
"xALf-OTSTRGC45EGqsgQaw"
],
"model_inference_stats": {
"count": 1,
"max": 482.676682,
"min": 482.676682,
"average": 482.676682,
"p50": 482.676682,
"p90": 482.676682,
"p99": 482.676682
},
"predict_request_stats": {
"count": 1,
"max": 486.432298,
"min": 486.432298,
"average": 486.432298,
"p50": 486.432298,
"p90": 486.432298,
"p99": 486.432298
}
}
}
```
This is a detailed example of a model profiling in a node. You can see, we have two types of stats `model_inference_stats` and `predict_request_stats`. `predict_request_stats` is the total latency and `model_inference_stats` is the LLM/model level latency.
I think our [documentation](https://opensearch.org/docs/latest/ml-commons-plugin/api/profile/) is not updated here. I hope this example clarifies all the confusion. Thanks.
This helps a lot. Can you please make sure that doc is also updated.