cluster-api-provider-openstack
cluster-api-provider-openstack copied to clipboard
Cluster deletion fails when nodes have no security groups applied
/kind bug
What steps did you take and what happened:
When a cluster is provisioned in a way that none of the ports have security groups applied, then when the cluster is deleted the security groups are deleted and the cluster is removed while there are still nodes running. This results in those nodes being unable to delete as they cannot find their infra cluster.
For example, this can happen when a cluster is provisioned where the only ports are SR-IOV ports on a VLAN. However it is possible to deploy a cluster on any network without applying security groups to the ports by settings securityGroups: []
on the network definition for a machine.
What did you expect to happen:
For the openstackcluster
to wait for all machines to be deleted before deleting.
Anything else you would like to add:
It seems like the successful deletion of the security groups is being used as a gate as to whether the cluster can be deleted. It seems like a simple fix could be to add the openstackcluster
as an ownerReference
of the openstackmachine
instances?
Environment:
- Cluster API Provider OpenStack version (Or
git rev-parse HEAD
if manually built): 0.5.2 - Cluster-API version: 1.1.0
- OpenStack version: Train
- Minikube/KIND version: N/A
- Kubernetes version (use
kubectl version
): 1.22.6 - OS (e.g. from
/etc/os-release
): Ubuntu 20.04
For the openstackcluster to wait for all machines to be deleted before deleting.
I can understand this request , but not not sure I understand the security group issue here
, then when the cluster is deleted the security groups are deleted and the cluster is removed while there are still nodes running
are you suggesting that security group delete got blocked or it blocked machine deletion? not fully understand the logic, do you have any logs that could help this bug report? Thanks a lot
are you suggesting that security group delete got blocked or it blocked machine deletion? not fully understand the logic, do you have any logs that could help this bug report? Thanks a lot
@jichenjc It is the opposite of this actually.
The deletion of the openstackcluster
and the corresponding openstackmachine
s proceed concurrently.
For the openstackcluster
, this process entails removing the load balancer, then removing the security groups, at which point the finalizer is removed and the resource disappears from Kubernetes. Normally, this blocks at removing the security groups until there are no openstackmachine
s left because the ports for the machines are using the security groups and everything is fine.
However in the case where port security is not enabled on the ports, e.g. when using SR-IOV for maximum performance, the openstackcluster
is able to delete the security groups before the nodes are deleted, and the resource is removed from Kubernetes before the openstackmachine
s have finished deleting. This causes the openstackmachine
controller to fail with messages about the infrastructure cluster not being available.
What we need is a mechanism to keep the openstackcluster
around until all the openstackmachine
s have been deleted in all cases.
Does that make more sense?
@mkjpryor could you please explain how you are triggering the deletion?
The cluster-api book recommends deleting all resources via the cluster
object: https://cluster-api.sigs.k8s.io/user/quick-start.html#clean-up to "ensure a proper cleanup of your infrastructure".
IMPORTANT: In order to ensure a proper cleanup of your infrastructure you must always delete the cluster object. Deleting the entire cluster template with kubectl delete -f capi-quickstart.yaml might lead to pending resources to be cleaned up manually.
The deletion of the openstackcluster and the corresponding openstackmachines proceed concurrently.
thanks for the detailed info , so if I understand correctly, it's not related to security group, it's just the trigger because it's anyway a timing issue, only scg group deletion wait time mitigate the problem by accident
so please share the method you use for the deletion as reference and I think following seems unreasonable , as the deletion of machines should be part of the cluster deletion process and I think compare to creation, the deletion is not that time consuming and we might accept serial actions for key parts (e.g LB first then machines then others and at last cluster itself)
The deletion of the openstackcluster and the corresponding openstackmachines proceed concurrently.
@mkjpryor Is this still a issue for you?
This is still a problem. We have a workaround that works for our specific use case, but it uses mechanisms outside of CAPI to do it so is not really a fix.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
@mkjpryor and even with the mentioned hint https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1143#issuecomment-1096704535 from @chrischdi you are still facing the issue? I only ask in case you missed it.
I think this can be closed as I don't have the issue when things are deleted in the correct order.