serve icon indicating copy to clipboard operation
serve copied to clipboard

[RFC]: Metrics Refactoring

Open lxning opened this issue 3 years ago • 12 comments

Current TorchServe has two mechanisms to emit metrics.

  1. Emit metrics to logs files in a StatsD like format by default .

In this case, both frontend and backend metrics are recorded in log file. However, the logs format is not standard StatsD format. They miss the metric type information (ie. counter, gauge, timer and so on). Users have to write regex to parse the log to build dashboard.

  1. Emit Prometheus formatted metrics.

In this case, existing TorchServe only emits 3 metrics.

  • ts_inference_requests_total
  • ts_inference_latency_microseconds
  • ts_queue_latency_microseconds

Users are not able to get model metrics and system metrics via metrics endpoint.

No central place to store Metrics definition

Existing TorchServe metrics definitions spread everywhere. It is difficult for users to know the available metrics.

Re-Design

TS_Metrics_Design.pdf

Sub tasks on frontend side

  • [x] #2133
  • [x] #2139
  • [x] #2140
  • [x] #2141
### Tasks
- [ ] https://github.com/pytorch/serve/issues/2747
- [ ] https://github.com/pytorch/serve/issues/2794
- [ ] https://github.com/pytorch/serve/issues/2772
- [ ] https://github.com/pytorch/serve/issues/2795

lxning avatar Mar 07 '22 22:03 lxning

For the requirements here need to make sure to include

  • Adding full coverage for frontend and backend metrics with prometheus export
  • Instructions to make it easy for anyone to export their own metrics to prometheus

msaroufim avatar Mar 22 '22 22:03 msaroufim

Yes, I only can see these 3 metrics in prometheus. I tried to change model log behavior with a log4j2.xml file, but with no success. How do I get system metrics in prometheus?

thimabru1010 avatar Mar 31 '22 21:03 thimabru1010

Custom metrics would be great. I'd like to add some histogram metrics for Prometheus to see latency percentiles.

jonhilgart22 avatar May 05 '22 17:05 jonhilgart22

Refactoring the existing metrics implementation is a welcome change. After reading through the proposed design, I recommend against introducing the metrics.yaml file.

In our case, we have model-specific performance metrics that we'd like to log and monitor. The ideal scenario is for each model archive to manage its own metrics. As those archives are loaded / unloaded they would register / unregister their metrics and those changes would be reflected in the metrics exposed by the frontend.

Defining a global metrics schema in metrics.yaml would require us to update that file separately every time a model is loaded or unloaded. In addition, doing so is repetitive since the proposed MetricsCaching.addMetric method already needs us to specify all the properties that would be in metrics.yaml.

sharvil avatar May 10 '22 22:05 sharvil

@msaroufim just wanted to check in about my previous comment. Given that models can be loaded/unloaded dynamically, I'm not sure how a pre-defined metrics.yaml file will work. Can we discuss before the PR is merged?

sharvil avatar Jul 13 '22 21:07 sharvil

@joshuaan7 and @lxning are leading the design and implementation for this refactor. It sounds like you're suggesting to archive metric definitions along with the model to have model specific metrics. It's a request in line with being a multi model serving framework if you view metrics as model specific but I can also understand if there's a case to be made that metrics are a framework level configuration. We can for convenience provide a global predefined metrics.yaml. I think your ask makes sense but I'll leave it to @joshuaan7 and @lxning to decide if, when and how to stage this work.

msaroufim avatar Jul 13 '22 21:07 msaroufim

My comments in this thread revolve around two suggestions which are independent of each other, but on the topic of this metrics refactor.

1. Model-specific metrics

We have model-specific metrics that we'd like to log and monitor. In our ideal world, TorchServe would provide a mechanism for each model to log and publish those metrics just like it does for global metrics (e.g. ts_inference_latency_microseconds). Doing so would reduce code duplication and present a consistent mechanism to handle metrics.

Concretely, we have models with multiple preprocessing and inference steps. We need to capture timing metrics for each of those steps so we can identify bottlenecks and potential regressions as we update our models. TorchServe can't, in general, know what those individual preprocessing/inference steps are so the metrics collection must happen within the model's handler.

2. Schema-free metrics

Defining an explicit schema in metrics.yaml introduces an additional step a developer must take in addition to instrumenting the code to capture metrics. But this schema is already known to TorchServe at runtime since the instrumented code must also provide this information. Moreover, it's error prone in case there's a mismatch between the schema declared in metrics.yaml and what's implemented in code.

Instead of having an explicit schema, I suggest either a) having TorchServe expose the schema it's currently using through the metrics endpoint, e.g. http://127.0.0.1:8082/metrics?schema; or b) writing out a metrics.yaml file periodically so users can see what metrics are available by inspecting that file. In either case, TorchServe does the work of producing the schema instead of a developer writing it by hand.

sharvil avatar Jul 14 '22 01:07 sharvil

@sharvil Let me try to answer your questions.

  • Existing Torchserve allows users to define customized model specific metrics by custom-metrics-api. All metrics from backend (ie., model metrics) are written into model_metrics.log. I assume you are using it in your case now. The problem for this solution is it has no well defined metrics type so that it is not able to correctly populate to prometheus metrics framework. As a legacy, Torchserve will still support the metrics from custom-metrics-api and write them into model_metrics.log as the current style.

  • The proposed MetricsCaching.addMetric method is used for loading all model metrics defined in yaml file during backend initialization. In other words, it is not necessary for users to call it again. Users can directly access MetricsCaching to get the loaded the metrics defined in yaml file. Eventually all the model metrics defined in the yaml file will be populated into prometheus metrics framework by the frontend. Users can access them via 127.0.0.1:8082/metrics

  • Regarding "Schema-free metrics", you mentioned two requirements.

  1. metrics schema endpoint: Prometheus API is able to provide the metrics info (eg. example) once the [RFC] is implemented.

  2. static metrics (ie. metrics config yaml file) vs. dynamic metrics. All the metrics including the metrics from Torchserve frontend (ie. in existing ts_metrics.log) and the model metrics from backend (ie. in existing model_metrics.log) need to be cached and populated to prometheus by the frontend. Considering performance and simplicity, we prefer static metrics caching in frontend. That's one of the reason we introduces metrics yaml file.

lxning avatar Jul 15 '22 00:07 lxning

Thanks for the detailed response.

Existing Torchserve allows users to define customized model specific metrics by custom-metrics-api. All metrics from backend (ie., model metrics) are written into model_metrics.log. I assume you are using it in your case now. The problem for this solution is it has no well defined metrics type so that it is not able to correctly populate to prometheus metrics framework. As a legacy, Torchserve will still support the metrics from custom-metrics-api and write them into model_metrics.log as the current style.

I think we're saying the same thing here. The existing custom metrics API doesn't go far enough because it doesn't publish to Prometheus. We'd like the metrics API refactor to allow for custom metrics that get published to Prometheus.

The proposed MetricsCaching.addMetric method is used for loading all model metrics defined in yaml file during backend initialization. In other words, it is not necessary for users to call it again. Users can directly access MetricsCaching to get the loaded the metrics defined in yaml file. Eventually all the model metrics defined in the yaml file will be populated into prometheus metrics framework by the frontend. Users can access them via 127.0.0.1:8082/metrics

Yes, I understand the proposed API design. I'm suggesting an alternative design to support the following scenario:

  • each model handler can log model-specific metrics
  • those model-specific metrics are published to Prometheus (by the frontend, which collects metrics from each backend)
  • TorchServe already has a mechanism to dynamically load/unload models without restarting the server, so the metrics logging and collection needs to be flexible enough to also handle dynamic model loading/unloading (i.e. not just at backend initialization time)

So maybe we should a step back and answer two questions:

  1. Is the scenario I described above one that the TorchServe team is interested in enabling?
  2. If so, how does the proposed API support that scenario?

@lxning, what do you think?

sharvil avatar Jul 18 '22 02:07 sharvil

If the model is configured with multiple workers, is there a way to know worker specific metrics/informations within the model logs? If not, can this be added.

duk0011 avatar Aug 11 '22 23:08 duk0011

@duk0011 worker is dynamically generated. For example, frontend will create a new worker if an existing worker dies. So in general, it is not very useful to monitor metrics on worker, especially in the case of elastic worker thread pooling in the future implementation. That's why Torchserve does not provider worker as a default dimension in the existing model metrics.

However, it is not a big deal to support worker as a metrics dimension. We can add worker id (ie. port number) in context so that user can fetch it in handler and emit it as a dimension value.

lxning avatar Oct 28 '22 22:10 lxning

@sharvil it is doable to support dynamic metrics configuration, but will have performance impact on Model Server. So Torchserve will support static metrics in phase 1 and dynamic metric in later phase.

lxning avatar Oct 28 '22 22:10 lxning