ml-commons icon indicating copy to clipboard operation
ml-commons copied to clipboard

[FEATURE] Addition of metrics for local/remote ml model latency

Open shashank31mar opened this issue 1 year ago • 4 comments

Is your feature request related to a problem? Current with _ml/profile, we provide over all latency of _ml/predict API including invocation call to model be it local or remote. There isn't any metics that helps customers to view the breakdown view of latency for _ml/predict API.

What solution would you like? There is a need to add P50, P75, P90, P99 Latency metrics for just the predict call to the ml model, be it local or remote.

shashank31mar avatar Mar 08 '24 05:03 shashank31mar

There isn't any metics that helps customers to view the breakdown view of latency for _ml/predict API.

Can you elaborate what level of breakdown ? The current _profile API provides end to end and latency and model part latency

ylwu-amzn avatar Mar 12 '24 15:03 ylwu-amzn

this is what _ml/profile contails

"predict_request_stats" : { # predict request stats on this node
            "count" : 2, # total predict requests on this node
            "max" : 89.978681, # max latency in milliseconds
            "min" : 5.402,
            "average" : 47.6903405,
            "p50" : 47.6903405,
            "p90" : 81.5210129,
            "p99" : 89.13291418999998
          }

Following things are not clear from this

  1. End to end latency for _ml/predict API --> I am assuming P90 reflects this one.
  2. Latency for calling Sagemaker or Bedrock API that is registered by CX

We need to make sure that second part is highlighted clearly.

shashank31mar avatar Mar 21 '24 04:03 shashank31mar

"models": {
        "TJDEX44BOcjlx1BH-HyR": {
          "model_state": "DEPLOYED",
          "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@6aa598ca",
          "target_worker_nodes": [
            "fIlV_8tmSpWU90nC_AjFsA",
            "fGtuXeLDSXu2VfHou1OciA",
            "EAU1Mpq-S9yNwFkPeViVug",
            "xALf-OTSTRGC45EGqsgQaw"
          ],
          "worker_nodes": [
            "fIlV_8tmSpWU90nC_AjFsA",
            "fGtuXeLDSXu2VfHou1OciA",
            "EAU1Mpq-S9yNwFkPeViVug",
            "xALf-OTSTRGC45EGqsgQaw"
          ],
          "model_inference_stats": {
            "count": 1,
            "max": 482.676682,
            "min": 482.676682,
            "average": 482.676682,
            "p50": 482.676682,
            "p90": 482.676682,
            "p99": 482.676682
          },
          "predict_request_stats": {
            "count": 1,
            "max": 486.432298,
            "min": 486.432298,
            "average": 486.432298,
            "p50": 486.432298,
            "p90": 486.432298,
            "p99": 486.432298
          }
        }
      }
      ```
      
This is a detailed example of a model profiling in a node. You can see, we have two types of stats `model_inference_stats` and `predict_request_stats`. `predict_request_stats` is the total latency and `model_inference_stats` is the LLM/model level latency.

I think our [documentation](https://opensearch.org/docs/latest/ml-commons-plugin/api/profile/) is not updated here. I hope this example clarifies all the confusion. Thanks.

dhrubo-os avatar Mar 21 '24 06:03 dhrubo-os

This helps a lot. Can you please make sure that doc is also updated.

shashank31mar avatar Mar 21 '24 06:03 shashank31mar