machine-controller-manager icon indicating copy to clipboard operation
machine-controller-manager copied to clipboard

Enhance MCM metrics

Open unmarshall opened this issue 1 year ago • 0 comments

How to categorize this issue?

/area control-plane /area monitoring /kind enhancement /priority 3

What would you like to be added:

Today MCM exposes metrics which has a few shortcomings:

  • Metrics do not follow the best practice/recommendations from Prometheus (Refer to this and this). We need to relook at the metrics and the labels that are used on them.
  • Contextual information is missing on metrics which prevents from correlating different metrics captured across different mcm and mcm-provider functions/Provider-API calls.

While we recommend to re-look at all the metrics but we also had some concrete improvements for 2 metrics that got recently introduced:

Provider API metrics:

APIRequestDuration: For this metric we propose to add additional labels which capture the following:

  • Provider API Operation that is invoked. Today we use service to capture that but we should relook at renaming this.
  • Driver Operation under which the provider API is invoked.
  • Machine name for which this API request is made
  • MCM reconciliation ID or run ID of the machine reconciler. The idea is to introduce a unique identifier for every reconcile run and pass it around to correlate logs and metrics.
  • MCM reconciliation flow Name - we could merge this along with run ID as well by choosing a naming convention that has both.

DriverAPIRequestDuration: For this metrics we propose to add additional labels which capture the following:

  • Driver Operation under which the provider API is invoked.
  • Machine name for which this API request is made
  • MCM reconciliation ID or run ID of the machine reconciler. The idea is to introduce a unique identifier for every reconcile run and pass it around to correlate logs and metrics.
  • MCM reconciliation flow Name - we could merge this along with run ID as well by choosing a naming convention that has both.

Provider Implementations:-

  • [ ] AWS https://github.com/gardener/machine-controller-manager-provider-aws/pull/153,
  • [x] Azure https://github.com/gardener/machine-controller-manager-provider-azure/pull/105
  • [ ] GCP
  • [ ] Openstack
  • [ ] Alicloud

Why is this needed:

This allows us to observe metrics at different levels:

  • Driver API methods level
  • Machine level
  • Provider API level
  • Reconcile Flow level

unmarshall avatar Nov 17 '23 10:11 unmarshall