cluster-api-provider-openstack icon indicating copy to clipboard operation
cluster-api-provider-openstack copied to clipboard

additionalPorts and allowedCidrs loops and never completes LB provisioning

Open huxcrux opened this issue 2 years ago • 17 comments

/kind bug

What steps did you take and what happened: When creating a cluster and defining booth one or more additionalPorts and at least one CIDR under allowedCidrs this causes the LB to never become fully functional and no cluster to be created.

It seems like there is a check missing for if the LB is out of PENDING_UPDATE causing the API to respond with

capo-system/capo-controller-manager-7c78d95c95-7qphw[manager]: I0926 13:49:42.720981 1 recorder.go:104] "events: Failed to create listener k8s-clusterapi-cluster-default-hux-lab1-kubeapi-31991: Expected HTTP response code [201 202] when accessing [POST https://<OPS-URL>:9876/v2.0/lbaas/listeners], but got 409 instead\n{"faultcode": "Client", "faultstring": "Load Balancer 7d690fcb-ea67-4462-a49e-dda806c8f792 is immutable and cannot be updated.", "debuginfo": null}" type="Warning" object={Kind:OpenStackCluster Namespace:default Name:hux-lab1 UID:d973908d-ccd0-4f5b-9f9f-d628143d021a APIVersion:infrastructure.cluster.x-k8s.io/v1alpha7 ResourceVersion:33567788 FieldPath:} reason="Failedcreatelistener"

I have also tested booth additionalPorts and allowedCidrs one by one and it works as intended, it's just when booth are used at the same time this is occuring.

Also worth noticing is that the allowedCidrs option is being set on the LB and I have verified that the security groups for the LB contains the correct rules meaning this is just a race condition during provisioning

What did you expect to happen: The flow seems to be:

  1. Create LB
  2. Create Listener
  3. If allowedCidrs is defined update the listener with the allowedCidrs list.
  4. Create the 2nd listener (this is done without checking if the LB is no longer in PENDING_UPDATE state causing an API failure and CAPO the allocates a ned FIP for the LB and try reconciling listeners again, however since the LB have a FIP connected it will continue to loop in this state)

Anything else you would like to add: A gist with completed output: https://gist.github.com/huxcrux/7c288f1c0b045de67eac1beaf3211e6e I redacted IPs and Openstack API URIs. Every time FIP attachment fails it's a new FIP that has been allocated.

Environment:

  • Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): 0.8.0
  • Cluster-API version: 1.5.2
  • OpenStack version: Ussuri
  • Minikube/KIND version: Kind 0.20.0
  • Kubernetes version (use kubectl version): 1.28.1
  • OS (e.g. from /etc/os-release): Ubuntu 20.04

huxcrux avatar Sep 27 '23 08:09 huxcrux

Thanks. Does it succeed eventually when it retries?

mdbooth avatar Sep 27 '23 10:09 mdbooth

We're definitely leaking floating IPs every time this fails, and in a way which looks like it might prevent the cluster ever coming up. I think that's the most serious issue here.

mdbooth avatar Sep 27 '23 10:09 mdbooth

We need to robustify this entire method against incremental failures. Unfortunately I don't have time to work on this myself right now, but if you are able to work on it or can find somebody else I can help.

My initial thoughts:

openStackCluster.Status.APIServerLoadBalancer is populated in one shot at the end of the method. It could/should be populated incrementally as the lb and floating IP are created.

Additionally (due to potential failure to write status update) we should take some measure to check if the FIP was previously created when creating a new FIP.

mdbooth avatar Sep 27 '23 11:09 mdbooth

Thanks. Does it succeed eventually when it retries?

No after the first fail it never recovers without me manually modifying the LB

huxcrux avatar Sep 27 '23 12:09 huxcrux

We need to robustify this entire method against incremental failures. Unfortunately I don't have time to work on this myself right now, but if you are able to work on it or can find somebody else I can help.

My initial thoughts:

openStackCluster.Status.APIServerLoadBalancer is populated in one shot at the end of the method. It could/should be populated incrementally as the lb and floating IP are created.

Additionally (due to potential failure to write status update) we should take some measure to check if the FIP was previously created when creating a new FIP.

Sounds like a good way forward. I do not have much time during the upcoming week or so sadly, if things change I'll post an update

huxcrux avatar Sep 27 '23 12:09 huxcrux

Possibly related issue that is also about leaking floating IPs: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1632

seanschneeweiss avatar Oct 20 '23 08:10 seanschneeweiss

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 30 '24 18:01 k8s-triage-robot

Is there any chance this was fixed by https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1829?

mdbooth avatar Jan 30 '24 20:01 mdbooth

Is there any chance this was fixed by #1829?

I think it's related. I will verify this tomorrow :)

huxcrux avatar Jan 31 '24 15:01 huxcrux

Is there any chance this was fixed by #1829?

Sadly this does not seem to fix the problem. I think the code change by the PR might contain a bug where it fails to patch due to the status object not existing.

EDIT: After more testing I think this is related to the same problem as https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1842 trying to resolve. If I manually set ready to false this code seems to be working.

I0201 10:05:29.343878       1 recorder.go:104] "events: Failed to create listener k8s-clusterapi-cluster-default-hux-lab1-kubeapi-31991: Expected HTTP response code [201 202] when accessing [POST https://ops.elastx.cloud:9876/v2.0/lbaas/listeners], but got 409 instead\n{\"faultcode\": \"Client\", \"faultstring\": \"Load Balancer cb8e830a-6cc8-4a0f-b854-270905bc43d1 is immutable and cannot be updated.\", \"debuginfo\": null}" type="Warning" object={"kind":"OpenStackCluster","namespace":"default","name":"hux-lab1","uid":"f9602d51-b6a4-4a38-be29-ca7dff70cd01","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha8","resourceVersion":"19401"} reason="Failedcreatelistener"
E0201 10:05:29.354438       1 controller.go:329] "Reconciler error" err=<
	[failed to reconcile load balancer: Expected HTTP response code [201 202] when accessing [POST https://ops.elastx.cloud:9876/v2.0/lbaas/listeners], but got 409 instead
	{"faultcode": "Client", "faultstring": "Load Balancer cb8e830a-6cc8-4a0f-b854-270905bc43d1 is immutable and cannot be updated.", "debuginfo": null}, error patching OpenStackCluster default/hux-lab1: OpenStackCluster.infrastructure.cluster.x-k8s.io "hux-lab1" is invalid: status.ready: Required value]
 > controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="default/hux-lab1" namespace="default" name="hux-lab1" reconcileID="edb06052-e2a4-4485-9bd9-01a30caa8e47"

I have identified two problems with the loadbalancer and allowedCIDRs logic.

  1. The first one being the check that waits for the LB to become active again. I do not know if it's to quick or something however it seems to pass before the changes are applied. The code responsible for the check can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L294-L297
  2. The list compare seems to be broken. even of the list of IPs matches a new update is triggered due to the list not being sorted. The code responsible for this can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L282

As a workaround I tested to just put a sleep just after the code linked under the first issue. 10 seconds seems to be just enough for everything to work. However if the second problem is resolved I think the first one will be resolved by itself after a new reconcile. However this require an additional reconcile per listener due to the allowedCIDR patch being added on each LB listener.

huxcrux avatar Feb 01 '24 13:02 huxcrux

/remove-lifecycle stale

huxcrux avatar Feb 01 '24 16:02 huxcrux

I have identified two problems with the loadbalancer and allowedCIDRs logic.

  1. The first one being the check that waits for the LB to become active again. I do not know if it's to quick or something however it seems to pass before the changes are applied. The code responsible for the check can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L294-L297

Is it possible that after updating the listener the loadbalancer becomes temporarily not ACTIVE?

cc @dulek

  1. The list compare seems to be broken. even of the list of IPs matches a new update is triggered due to the list not being sorted. The code responsible for this can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L282

As a workaround I tested to just put a sleep just after the code linked under the first issue. 10 seconds seems to be just enough for everything to work. However if the second problem is resolved I think the first one will be resolved by itself after a new reconcile. However this require an additional reconcile per listener due to the allowedCIDR patch being added on each LB listener.

mdbooth avatar Feb 02 '24 14:02 mdbooth

I have identified two problems with the loadbalancer and allowedCIDRs logic.

  1. The first one being the check that waits for the LB to become active again. I do not know if it's to quick or something however it seems to pass before the changes are applied. The code responsible for the check can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L294-L297

Is it possible that after updating the listener the loadbalancer becomes temporarily not ACTIVE?

cc @dulek

The LB will go ACTIVE->PENDING_UPDATE->ACTIVE when AllowedCIDRs are modified. You can generally expect that any action done on the LB or its children requires you to wait for the LB to be ACTIVE again.

Also looking at that linked code fragment - you should never wait for the listener to be ACTIVE again, but rather for the whole LB. I think this might be causing the problems. Never wait for anything else than the LB, state management for the underlying resources is unreliable, only LB status counts when waiting for readiness.

Yes, Octavia API is really tough for users.

cc @huxcrux

dulek avatar Feb 08 '24 16:02 dulek

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 08 '24 17:05 k8s-triage-robot

/remove-lifecycle stale

EmilienM avatar May 08 '24 17:05 EmilienM

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 06 '24 17:08 k8s-triage-robot

/remove-lifecycle stale

EmilienM avatar Aug 06 '24 18:08 EmilienM

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 04 '24 18:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Dec 04 '24 19:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jan 03 '25 19:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jan 03 '25 19:01 k8s-ci-robot