machine-controller-manager icon indicating copy to clipboard operation
machine-controller-manager copied to clipboard

Improve Monitoring/Alerting/Metrics

Open PadmaB opened this issue 6 years ago • 7 comments

Story

As a provider I want timely alerts raised based on the metrics to take informed decisions

Motivation

  • MCM exposes a number of metrics like number of API calls to Cloud Provider, Freeze status of MCM #189
  • Define some alerts based on the metrics will help the Ops to react in a timely manner, in case of any action required
  • Challenges with Azure Cloud Provider during deletion of machines #200

Acceptance Criteria

  • [ ] Define alerts for the above situations to take required action

Definition of Done

  • [ ] Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • [ ] Unit tests are provided: Have you written automated unit tests?
  • [ ] Integration tests are provided: Have you written automated integration tests?
  • [ ] Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
  • [ ] Operations guide: Have you updated the operations guide about ops-relevant changes?
  • [ ] User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

Possible metrices to add (Rough work)

  • we could provide metrices on number of machines with different statuses , so filtering on that can be done (if already not exposed)
  • metrics about time taken for machine to join can be added, this will help to know overall average joining time on any provider
  • when MCM did scale-up , scale-down and when CA did.
  • metices that could solve typical DoD issues, like node not joining.
  • how much each resource took to get created like VM, disk especially in Azure.

PadmaB avatar Jan 24 '19 05:01 PadmaB

I have tried to at least expose a few crucial metrics into the Gardener Prometheus for now. Refer - https://github.com/gardener/gardener/pull/948.

However, we will need to further enhance metrics to always return values and not return blank values (like mcm_cloud_api_requests_failed_total, mcm_cloud_api_requests_total, mcm_machine_deployment_failed_machines ) for all the metrics before trying to create a dashboard and raise alerts. Refer - https://github.com/gardener/gardener/pull/948#issuecomment-485757761

prashanth26 avatar Apr 24 '19 05:04 prashanth26

/touch /priority critical

prashanth26 avatar Oct 29 '20 13:10 prashanth26

Changing the roadmap classification as this ticket speaks of "ops" and MCM metrics. This is more internal than end user facing, although one can argue that MCM appears in our exposed monitoring. If you don't agree, please change back @hardikdr . It was just a gut feeling that this is maybe more relevant internally than externally.

vlerenc avatar Nov 10 '20 19:11 vlerenc

Sure, sounds good. The major part of it is for internal usage, and only an aspect is for end-users where we want to offer better observability for the worker-machines.

hardikdr avatar Nov 11 '20 03:11 hardikdr

Adding feedback from https://github.com/gardener/machine-controller-manager/issues/549, https://github.com/gardener/machine-controller-manager/issues/528

  • [ ] Add metric to let end-users know why the mcm has replaced a node.
  • [ ] Expose OS name and OS version of the machines to prometheus.
  • [ ] Expose provider-ID as a metric for each node
  • [ ] Relook at metric for other objects like - machineClasses and machineDeployments - https://github.com/gardener/machine-controller-manager/issues/443.

prashanth26 avatar Mar 30 '21 05:03 prashanth26

We need to introduce metrics for following cases:

  • [ ] machine drain - especially metrics related to pod eviction/deletion, number of times drain was evicted, number of times health-check failed,
  • [ ] We also need to fix the failed machines metric which repeatedly delivers machines with lastOperation as failed . #456 . Also this can be confused with machine with phase as Failed so the name needs to be changed from failed_machines to something like failed_last_operation_machines .We need an alternate metric for users.
  • [ ] We need metrics for provider/driver API calls made by MCM. #483
  • [ ] Machine Phase Metrics to accompany proposed Node Condition metrics
  • [ ] Add Metrics for when VM status fine but Node is un-healthy.
  • [ ] Add Metrics to determine how many times did the AWS Auto Recovery feature kicked-in for our VMs. And whether such auto-recovers occurred within our health check timeout. Related to decision for: https://github.com/gardener/machine-controller-manager-provider-aws/issues/94
  • [ ] Update stale_machines_total metric name to stale_machines_removed_total , https://github.com/gardener/machine-controller-manager/pull/808#discussion_r1218369226
  • [ ] Add documentation like that in etcd-druid and etcd-br
  • [ ] Update requests_failed_total , requests_total in different mcm-provider, its currently exposed without update
  • [ ] Update the new metrices introduced by PR https://github.com/gardener/machine-controller-manager/pull/842 . Make sure to also update g-extension , as it deploys the prometheus scrape config

elankath avatar Feb 20 '23 11:02 elankath

@elankath You have mentioned internal references in the public. Please check.

gardener-robot avatar Feb 22 '23 10:02 gardener-robot