ml-commons [FEATURE] Addition of metrics for local/remote ml model latency

Is your feature request related to a problem? Current with _ml/profile, we provide over all latency of _ml/predict API including invocation call to model be it local or remote. There isn't any metics that helps customers to view the breakdown view of latency for _ml/predict API.

What solution would you like? There is a need to add P50, P75, P90, P99 Latency metrics for just the predict call to the ml model, be it local or remote.

Mar 08 '24 05:03 shashank31mar

There isn't any metics that helps customers to view the breakdown view of latency for _ml/predict API.

Can you elaborate what level of breakdown ? The current _profile API provides end to end and latency and model part latency

Mar 12 '24 15:03 ylwu-amzn

this is what _ml/profile contails

"predict_request_stats" : { # predict request stats on this node
            "count" : 2, # total predict requests on this node
            "max" : 89.978681, # max latency in milliseconds
            "min" : 5.402,
            "average" : 47.6903405,
            "p50" : 47.6903405,
            "p90" : 81.5210129,
            "p99" : 89.13291418999998
          }

Following things are not clear from this

End to end latency for _ml/predict API --> I am assuming P90 reflects this one.
Latency for calling Sagemaker or Bedrock API that is registered by CX

We need to make sure that second part is highlighted clearly.

Mar 21 '24 04:03 shashank31mar

"models": {
        "TJDEX44BOcjlx1BH-HyR": {
          "model_state": "DEPLOYED",
          "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@6aa598ca",
          "target_worker_nodes": [
            "fIlV_8tmSpWU90nC_AjFsA",
            "fGtuXeLDSXu2VfHou1OciA",
            "EAU1Mpq-S9yNwFkPeViVug",
            "xALf-OTSTRGC45EGqsgQaw"
          ],
          "worker_nodes": [
            "fIlV_8tmSpWU90nC_AjFsA",
            "fGtuXeLDSXu2VfHou1OciA",
            "EAU1Mpq-S9yNwFkPeViVug",
            "xALf-OTSTRGC45EGqsgQaw"
          ],
          "model_inference_stats": {
            "count": 1,
            "max": 482.676682,
            "min": 482.676682,
            "average": 482.676682,
            "p50": 482.676682,
            "p90": 482.676682,
            "p99": 482.676682
          },
          "predict_request_stats": {
            "count": 1,
            "max": 486.432298,
            "min": 486.432298,
            "average": 486.432298,
            "p50": 486.432298,
            "p90": 486.432298,
            "p99": 486.432298
          }
        }
      }
      ```
      
This is a detailed example of a model profiling in a node. You can see, we have two types of stats `model_inference_stats` and `predict_request_stats`. `predict_request_stats` is the total latency and `model_inference_stats` is the LLM/model level latency.

I think our [documentation](https://opensearch.org/docs/latest/ml-commons-plugin/api/profile/) is not updated here. I hope this example clarifies all the confusion. Thanks.

Mar 21 '24 06:03 dhrubo-os

This helps a lot. Can you please make sure that doc is also updated.

Mar 21 '24 06:03 shashank31mar

ml-commons ml-commons copied to clipboard

[FEATURE] Addition of metrics for local/remote ml model latency

ml-commons
ml-commons copied to clipboard