cluster-api-provider-vsphere icon indicating copy to clipboard operation
cluster-api-provider-vsphere copied to clipboard

Exposed Prometheus metrics for unreachable workload clusters

Open perithompson opened this issue 4 years ago • 4 comments
trafficstars

/kind feature

Describe the solution you'd like When a Management cluster cannot reach a Workload Cluster it may be necessary to pause reconciliation of that cluster until a time when cluster connectivity can be restored in order to prevent capi from constantly trying to reconcile something it can't. There is an issue(#5394) that outlines some gaps in the documentation around this.

One problem operators will face is knowing when this state occurs. A simple way to monitor for this state is a metric for monitoring workload clusters. Having count of workload clusters, count of paused and count of unreachable clusters exposed to Prometheus would allow for an alert on a change to unreachable cluster count and then operators could implement some automation or SOPs to check and pause clusters that cannot reconcile for any reason.

Anything else you would like to add:

Environment:

perithompson avatar Oct 27 '21 09:10 perithompson

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 25 '22 10:01 k8s-triage-robot

@perithompson This seems like something that will sit in the CAPI repo. I am happy to keep this one around if CAPV would need to do something specifically, but I think the entire change will rest directly in CAPI.

srm09 avatar Jan 28 '22 22:01 srm09

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 27 '22 23:02 k8s-triage-robot

/remove-lifecycle rotten /lifecycle frozen

srm09 avatar Mar 01 '22 01:03 srm09

Maybe gets resolved or partially resolved in #2061

chrischdi avatar Aug 17 '23 17:08 chrischdi

I think this should be closed in favor of a solution on the CAPI side - there's a related issue here: https://github.com/kubernetes-sigs/cluster-api/issues/5510

killianmuldoon avatar Aug 17 '23 18:08 killianmuldoon

+1

/close

sbueringer avatar Aug 17 '23 18:08 sbueringer

@sbueringer: Closing this issue.

In response to this:

+1

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 17 '23 18:08 k8s-ci-robot