machine-controller-manager
machine-controller-manager copied to clipboard
Enhance MCM metrics
How to categorize this issue?
/area control-plane /area monitoring /kind enhancement /priority 3
What would you like to be added:
Today MCM exposes metrics which has a few shortcomings:
- Metrics do not follow the best practice/recommendations from Prometheus (Refer to this and this). We need to relook at the metrics and the labels that are used on them.
- Contextual information is missing on metrics which prevents from correlating different metrics captured across different mcm and mcm-provider functions/Provider-API calls.
While we recommend to re-look at all the metrics but we also had some concrete improvements for 2 metrics that got recently introduced:
Provider API metrics:
APIRequestDuration: For this metric we propose to add additional labels which capture the following:
- Provider API Operation that is invoked. Today we use
service
to capture that but we should relook at renaming this. - Driver Operation under which the provider API is invoked.
- Machine name for which this API request is made
- MCM reconciliation ID or run ID of the machine reconciler. The idea is to introduce a unique identifier for every reconcile run and pass it around to correlate logs and metrics.
- MCM reconciliation flow Name - we could merge this along with run ID as well by choosing a naming convention that has both.
DriverAPIRequestDuration: For this metrics we propose to add additional labels which capture the following:
- Driver Operation under which the provider API is invoked.
- Machine name for which this API request is made
- MCM reconciliation ID or run ID of the machine reconciler. The idea is to introduce a unique identifier for every reconcile run and pass it around to correlate logs and metrics.
- MCM reconciliation flow Name - we could merge this along with run ID as well by choosing a naming convention that has both.
Provider Implementations:-
- [ ] AWS https://github.com/gardener/machine-controller-manager-provider-aws/pull/153,
- [x] Azure https://github.com/gardener/machine-controller-manager-provider-azure/pull/105
- [ ] GCP
- [ ] Openstack
- [ ] Alicloud
Why is this needed:
This allows us to observe metrics at different levels:
- Driver API methods level
- Machine level
- Provider API level
- Reconcile Flow level