cluster-api-provider-vsphere Exposed Prometheus metrics for unreachable workload clusters

trafficstars

/kind feature

Describe the solution you'd like When a Management cluster cannot reach a Workload Cluster it may be necessary to pause reconciliation of that cluster until a time when cluster connectivity can be restored in order to prevent capi from constantly trying to reconcile something it can't. There is an issue(#5394) that outlines some gaps in the documentation around this.

One problem operators will face is knowing when this state occurs. A simple way to monitor for this state is a metric for monitoring workload clusters. Having count of workload clusters, count of paused and count of unreachable clusters exposed to Prometheus would allow for an alert on a change to unreachable cluster count and then operators could implement some automation or SOPs to check and pause clusters that cannot reconcile for any reason.

Anything else you would like to add:

Environment:

Oct 27 '21 09:10 perithompson

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 25 '22 10:01 k8s-triage-robot

@perithompson This seems like something that will sit in the CAPI repo. I am happy to keep this one around if CAPV would need to do something specifically, but I think the entire change will rest directly in CAPI.

Jan 28 '22 22:01 srm09

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 27 '22 23:02 k8s-triage-robot

/remove-lifecycle rotten /lifecycle frozen

Mar 01 '22 01:03 srm09

Maybe gets resolved or partially resolved in #2061

Aug 17 '23 17:08 chrischdi

I think this should be closed in favor of a solution on the CAPI side - there's a related issue here: https://github.com/kubernetes-sigs/cluster-api/issues/5510

Aug 17 '23 18:08 killianmuldoon

+1

/close

Aug 17 '23 18:08 sbueringer

@sbueringer: Closing this issue.

In response to this:

+1

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 17 '23 18:08 k8s-ci-robot

cluster-api-provider-vsphere cluster-api-provider-vsphere copied to clipboard

Exposed Prometheus metrics for unreachable workload clusters

cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard