cluster-api-provider-openstack icon indicating copy to clipboard operation
cluster-api-provider-openstack copied to clipboard

Configurable api server loadbalancer monitor

Open MPV opened this issue 2 years ago • 12 comments

/kind feature

Describe the solution you'd like

Currently the OpenStack loadbalancer monitor for the API server created by CAPO has hardcoded settings, as seen in: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/944b2658fb11e8212aab43e5bbf112b15a8bb75a/pkg/cloud/services/loadbalancer/loadbalancer.go#L223-L230

In other words, Delay: 30, Timeout: 5, MaxRetries: 3 means:

After 105 (=3*(30+5)) seconds of downtime, API server pool members will be marked as down, and when it's back up again, another 3 attempts/MaxRetries are needed for it to be added back in the pool again, i.e. 105 more seconds.

...which in turn means that if all API server members become unavailable at the same time for 1,5 minutes, it will mean downtime in total of at least 3,5 minutes (2*3*(30+5))

This may sound weird, but it's the official behavior of max-retries to apply to taking members both down and up, as per https://docs.openstack.org/octavia/queens/user/guides/basic-cookbook.html#heath-monitor-options (quoted below):

max-retries: Number of subsequent health checks a given back-end server must fail before it is considered down, or that a failed back-end server must pass to be considered up again.

Anything else you would like to add:

  1. We had a brief outage related to this when all our control-plane nodes were accidentally running on the same OpenStack Nova/hypervisor host which had a network issues/downtime.
    • We'll soon try out the hard/soft anti-affinity policies, which will decrease the risk for this kind of failure, but faster recovery overall for API server LB pool members might still help.
  2. I haven't yet looked at if/how other CAPI providers solve this.

MPV avatar Apr 26 '22 14:04 MPV

We'll soon try out the hard/soft anti-affinity policies

so seems we should recommend anti-affinity anyhow (I knew OCP might do this ,but CAPI might not?)

...which in turn means that if all API server members become unavailable at the same time for 1,5 minutes, it will mean downtime in total of at least 3,5 minutes (23(30+5))

so your request is to consider decrease the max_retry so that it will at least be considered uprunning when we detect it's healthy again? so that the down time will be smaller than current setting?

jichenjc avatar Apr 27 '22 02:04 jichenjc

...which in turn means that if all API server members become unavailable at the same time for 1,5 minutes, it will mean downtime in total of at least 3,5 minutes (23(30+5))

so your request is to consider decrease the max_retry so that it will at least be considered uprunning when we detect it's healthy again? so that the down time will be smaller than current setting?

Yes, and possibly discussing what sane defaults would mean.

From our side we're only running a few clusters using CAPO. I suspect there may be others with more experience that maybe could chime in with related thoughts/experience?

One related challenge with CAPI/CAPO is that it uses/asks the API server about cluster details in order to manage it, hence having downtime of the Kube API LB means CAPI/CAPO gets a harder time trying to fix the cluster.

MPV avatar Apr 27 '22 18:04 MPV

/good-first-issue

seanschneeweiss avatar Jun 15 '22 13:06 seanschneeweiss

@seanschneeweiss: This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 15 '22 13:06 k8s-ci-robot

IMO it would be great to allign the names in the spec as good as possible with the already existing one in CPI

bavarianbidi avatar Jun 24 '22 05:06 bavarianbidi

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 22 '22 05:09 k8s-triage-robot

/remove-lifecycle stale

bavarianbidi avatar Sep 22 '22 06:09 bavarianbidi

So just to get a ballpark estimate of the scope here.

Looking at https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/openstack-cloud-controller-manager/using-openstack-cloud-controller-manager.md#load-balancer it seems we need to add:

  • create-monitor
  • monitor-delay
  • monitor-max-retries
  • monitor-timeout

To the lb type: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/main/api/v1alpha6/types.go#L230-L237

But it would also need to be added to: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/main/api/v1alpha6/types.go#L309-L316 to be able to explicitly set the configuration for the APIServerLoadBalancer. Then of course we need to ensure there is test coverage, data is passed to the correct place etc. Anything that I missed? Will it require anything from CAPI?

oscr avatar Oct 10 '22 11:10 oscr

sounds good plan I am wondering whether we can use a KV instead of individual property? e.g []properties so that we can easily extend such ? you can provide 2 or 3 or 0 values then we honor it?

jichenjc avatar Oct 12 '22 03:10 jichenjc

@jichenjc What do you think about an optional struct, similar to how SubnetFilter is defined here https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/02a31730ef1ae7962f1cc36c090b624025a2f5bc/api/v1alpha6/types.go#L88 Then it can be extended in the future and the users can ignore it unless they want to set it.

oscr avatar Oct 12 '22 08:10 oscr

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 08 '23 08:02 k8s-triage-robot

Related (for reference):

  • https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1374
    • https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1375

MPV avatar Nov 15 '23 09:11 MPV