Story

As a provider I want timely alerts raised based on the metrics to take informed decisions

Motivation

MCM exposes a number of metrics like number of API calls to Cloud Provider, Freeze status of MCM #189
Define some alerts based on the metrics will help the Ops to react in a timely manner, in case of any action required
Challenges with Azure Cloud Provider during deletion of machines #200

Acceptance Criteria

[ ] Define alerts for the above situations to take required action

Definition of Done

[ ] Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
[ ] Unit tests are provided: Have you written automated unit tests?
[ ] Integration tests are provided: Have you written automated integration tests?
[ ] Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
[ ] Operations guide: Have you updated the operations guide about ops-relevant changes?
[ ] User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

Possible metrices to add (Rough work)

we could provide metrices on number of machines with different statuses , so filtering on that can be done (if already not exposed)
metrics about time taken for machine to join can be added, this will help to know overall average joining time on any provider
when MCM did scale-up , scale-down and when CA did.
metices that could solve typical DoD issues, like node not joining.
how much each resource took to get created like VM, disk especially in Azure.

Jan 24 '19 05:01 PadmaB

I have tried to at least expose a few crucial metrics into the Gardener Prometheus for now. Refer - https://github.com/gardener/gardener/pull/948.

However, we will need to further enhance metrics to always return values and not return blank values (like mcm_cloud_api_requests_failed_total, mcm_cloud_api_requests_total, mcm_machine_deployment_failed_machines ) for all the metrics before trying to create a dashboard and raise alerts. Refer - https://github.com/gardener/gardener/pull/948#issuecomment-485757761

Apr 24 '19 05:04 prashanth26

/touch /priority critical

Oct 29 '20 13:10 prashanth26

Changing the roadmap classification as this ticket speaks of "ops" and MCM metrics. This is more internal than end user facing, although one can argue that MCM appears in our exposed monitoring. If you don't agree, please change back @hardikdr . It was just a gut feeling that this is maybe more relevant internally than externally.

Nov 10 '20 19:11 vlerenc

Sure, sounds good. The major part of it is for internal usage, and only an aspect is for end-users where we want to offer better observability for the worker-machines.

Nov 11 '20 03:11 hardikdr

Adding feedback from https://github.com/gardener/machine-controller-manager/issues/549, https://github.com/gardener/machine-controller-manager/issues/528

[ ] Add metric to let end-users know why the mcm has replaced a node.
[ ] Expose OS name and OS version of the machines to prometheus.
[ ] Expose provider-ID as a metric for each node
[ ] Relook at metric for other objects like - machineClasses and machineDeployments - https://github.com/gardener/machine-controller-manager/issues/443.

Mar 30 '21 05:03 prashanth26

We need to introduce metrics for following cases:

[ ] machine drain - especially metrics related to pod eviction/deletion, number of times drain was evicted, number of times health-check failed,
[ ] We also need to fix the failed machines metric which repeatedly delivers machines with lastOperation as failed . #456 . Also this can be confused with machine with phase as Failed so the name needs to be changed from failed_machines to something like failed_last_operation_machines .We need an alternate metric for users.
[ ] We need metrics for provider/driver API calls made by MCM. #483
[ ] Machine Phase Metrics to accompany proposed Node Condition metrics
[ ] Add Metrics for when VM status fine but Node is un-healthy.
[ ] Add Metrics to determine how many times did the AWS Auto Recovery feature kicked-in for our VMs. And whether such auto-recovers occurred within our health check timeout. Related to decision for: https://github.com/gardener/machine-controller-manager-provider-aws/issues/94
[ ] Update stale_machines_total metric name to stale_machines_removed_total , https://github.com/gardener/machine-controller-manager/pull/808#discussion_r1218369226
[ ] Add documentation like that in etcd-druid and etcd-br
[ ] Update requests_failed_total , requests_total in different mcm-provider, its currently exposed without update
[ ] Update the new metrices introduced by PR https://github.com/gardener/machine-controller-manager/pull/842 . Make sure to also update g-extension , as it deploys the prometheus scrape config

Feb 20 '23 11:02 elankath

@elankath You have mentioned internal references in the public. Please check.

Feb 22 '23 10:02 gardener-robot

machine-controller-manager
machine-controller-manager copied to clipboard

Improve Monitoring/Alerting/Metrics

Story

Motivation

Acceptance Criteria

Definition of Done

machine-controller-manager machine-controller-manager copied to clipboard

Improve Monitoring/Alerting/Metrics

Story

Motivation

Acceptance Criteria

Definition of Done

machine-controller-manager
machine-controller-manager copied to clipboard