cluster-api-provider-openstack Cluster deletion fails when nodes have no security groups applied

Cluster deletion fails when nodes have no security groups applied

Open mkjpryor opened this issue 2 years ago • 8 comments

/kind bug

What steps did you take and what happened:

When a cluster is provisioned in a way that none of the ports have security groups applied, then when the cluster is deleted the security groups are deleted and the cluster is removed while there are still nodes running. This results in those nodes being unable to delete as they cannot find their infra cluster.

For example, this can happen when a cluster is provisioned where the only ports are SR-IOV ports on a VLAN. However it is possible to deploy a cluster on any network without applying security groups to the ports by settings securityGroups: [] on the network definition for a machine.

What did you expect to happen:

For the openstackcluster to wait for all machines to be deleted before deleting.

Anything else you would like to add:

It seems like the successful deletion of the security groups is being used as a gate as to whether the cluster can be deleted. It seems like a simple fix could be to add the openstackcluster as an ownerReference of the openstackmachine instances?

Environment:

Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): 0.5.2
Cluster-API version: 1.1.0
OpenStack version: Train
Minikube/KIND version: N/A
Kubernetes version (use kubectl version): 1.22.6
OS (e.g. from /etc/os-release): Ubuntu 20.04

Feb 17 '22 16:02 mkjpryor

For the openstackcluster to wait for all machines to be deleted before deleting.

I can understand this request , but not not sure I understand the security group issue here

, then when the cluster is deleted the security groups are deleted and the cluster is removed while there are still nodes running

are you suggesting that security group delete got blocked or it blocked machine deletion? not fully understand the logic, do you have any logs that could help this bug report? Thanks a lot

Feb 18 '22 03:02 jichenjc

are you suggesting that security group delete got blocked or it blocked machine deletion? not fully understand the logic, do you have any logs that could help this bug report? Thanks a lot

@jichenjc It is the opposite of this actually.

The deletion of the openstackcluster and the corresponding openstackmachines proceed concurrently.

For the openstackcluster, this process entails removing the load balancer, then removing the security groups, at which point the finalizer is removed and the resource disappears from Kubernetes. Normally, this blocks at removing the security groups until there are no openstackmachines left because the ports for the machines are using the security groups and everything is fine.

However in the case where port security is not enabled on the ports, e.g. when using SR-IOV for maximum performance, the openstackcluster is able to delete the security groups before the nodes are deleted, and the resource is removed from Kubernetes before the openstackmachines have finished deleting. This causes the openstackmachine controller to fail with messages about the infrastructure cluster not being available.

What we need is a mechanism to keep the openstackcluster around until all the openstackmachines have been deleted in all cases.

Does that make more sense?

Apr 12 '22 10:04 mkjpryor

@mkjpryor could you please explain how you are triggering the deletion?

The cluster-api book recommends deleting all resources via the cluster object: https://cluster-api.sigs.k8s.io/user/quick-start.html#clean-up to "ensure a proper cleanup of your infrastructure".

IMPORTANT: In order to ensure a proper cleanup of your infrastructure you must always delete the cluster object. Deleting the entire cluster template with kubectl delete -f capi-quickstart.yaml might lead to pending resources to be cleaned up manually.

Apr 12 '22 13:04 chrischdi

The deletion of the openstackcluster and the corresponding openstackmachines proceed concurrently.

thanks for the detailed info , so if I understand correctly, it's not related to security group, it's just the trigger because it's anyway a timing issue, only scg group deletion wait time mitigate the problem by accident

so please share the method you use for the deletion as reference and I think following seems unreasonable , as the deletion of machines should be part of the cluster deletion process and I think compare to creation, the deletion is not that time consuming and we might accept serial actions for key parts (e.g LB first then machines then others and at last cluster itself)

The deletion of the openstackcluster and the corresponding openstackmachines proceed concurrently.

Apr 13 '22 01:04 jichenjc

@mkjpryor Is this still a issue for you?

Jun 15 '22 13:06 apricote

This is still a problem. We have a workaround that works for our specific use case, but it uses mechanisms outside of CAPI to do it so is not really a fix.

Jun 15 '22 14:06 mkjpryor

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 13 '22 14:09 k8s-triage-robot

/remove-lifecycle stale

Sep 13 '22 23:09 jichenjc

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 13 '22 00:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 12 '23 01:01 k8s-triage-robot

@mkjpryor and even with the mentioned hint https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1143#issuecomment-1096704535 from @chrischdi you are still facing the issue? I only ask in case you missed it.

Jan 14 '23 12:01 tobiasgiese

I think this can be closed as I don't have the issue when things are deleted in the correct order.

Jan 16 '23 08:01 mkjpryor

cluster-api-provider-openstack cluster-api-provider-openstack copied to clipboard

Cluster deletion fails when nodes have no security groups applied

cluster-api-provider-openstack
cluster-api-provider-openstack copied to clipboard