cluster-api-provider-openstack
cluster-api-provider-openstack copied to clipboard
Configurable api server loadbalancer monitor
/kind feature
Describe the solution you'd like
Currently the OpenStack loadbalancer monitor for the API server created by CAPO has hardcoded settings, as seen in: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/944b2658fb11e8212aab43e5bbf112b15a8bb75a/pkg/cloud/services/loadbalancer/loadbalancer.go#L223-L230
In other words, Delay: 30, Timeout: 5, MaxRetries: 3
means:
After 105 (=
3*(30+5)
) seconds of downtime, API server pool members will be marked as down, and when it's back up again, another 3 attempts/MaxRetries are needed for it to be added back in the pool again, i.e. 105 more seconds.
...which in turn means that if all API server members become unavailable at the same time for 1,5 minutes, it will mean downtime in total of at least 3,5 minutes (2*3*(30+5)
)
This may sound weird, but it's the official behavior of max-retries
to apply to taking members both down and up, as per https://docs.openstack.org/octavia/queens/user/guides/basic-cookbook.html#heath-monitor-options (quoted below):
max-retries
: Number of subsequent health checks a given back-end server must fail before it is considered down, or that a failed back-end server must pass to be considered up again.
Anything else you would like to add:
- We had a brief outage related to this when all our control-plane nodes were accidentally running on the same OpenStack Nova/hypervisor host which had a network issues/downtime.
- We'll soon try out the hard/soft anti-affinity policies, which will decrease the risk for this kind of failure, but faster recovery overall for API server LB pool members might still help.
- I haven't yet looked at if/how other CAPI providers solve this.
We'll soon try out the hard/soft anti-affinity policies
so seems we should recommend anti-affinity anyhow (I knew OCP might do this ,but CAPI might not?)
...which in turn means that if all API server members become unavailable at the same time for 1,5 minutes, it will mean downtime in total of at least 3,5 minutes (23(30+5))
so your request is to consider decrease the max_retry so that it will at least be considered uprunning when we detect it's healthy again? so that the down time will be smaller than current setting?
...which in turn means that if all API server members become unavailable at the same time for 1,5 minutes, it will mean downtime in total of at least 3,5 minutes (23(30+5))
so your request is to consider decrease the max_retry so that it will at least be considered uprunning when we detect it's healthy again? so that the down time will be smaller than current setting?
Yes, and possibly discussing what sane defaults would mean.
From our side we're only running a few clusters using CAPO. I suspect there may be others with more experience that maybe could chime in with related thoughts/experience?
One related challenge with CAPI/CAPO is that it uses/asks the API server about cluster details in order to manage it, hence having downtime of the Kube API LB means CAPI/CAPO gets a harder time trying to fix the cluster.
/good-first-issue
@seanschneeweiss: This request has been marked as suitable for new contributors.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue
command.
In response to this:
/good-first-issue
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
IMO it would be great to allign the names in the spec as good as possible with the already existing one in CPI
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
So just to get a ballpark estimate of the scope here.
Looking at https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/openstack-cloud-controller-manager/using-openstack-cloud-controller-manager.md#load-balancer it seems we need to add:
- create-monitor
- monitor-delay
- monitor-max-retries
- monitor-timeout
To the lb type: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/main/api/v1alpha6/types.go#L230-L237
But it would also need to be added to: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/main/api/v1alpha6/types.go#L309-L316 to be able to explicitly set the configuration for the APIServerLoadBalancer. Then of course we need to ensure there is test coverage, data is passed to the correct place etc. Anything that I missed? Will it require anything from CAPI?
sounds good plan I am wondering whether we can use a KV instead of individual property? e.g []properties so that we can easily extend such ? you can provide 2 or 3 or 0 values then we honor it?
@jichenjc What do you think about an optional struct, similar to how SubnetFilter is defined here https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/02a31730ef1ae7962f1cc36c090b624025a2f5bc/api/v1alpha6/types.go#L88 Then it can be extended in the future and the users can ignore it unless they want to set it.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale
- Close this PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Related (for reference):
- https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1374
- https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1375