cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard
Ensure reconcileLoadBalancedEndpoint Function Gets Called Despite Setting of ControlPlaneEndpoint
/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.] The current functionality of CAPV bypasses the reconcileLoadBalancedEndpoint if the ControlPlaneEndpoint is set in either the cluster or vspherecluster. Nonetheless, undesirable scenarios may arise, such as the accidental deletion of the virtualmachineservice corresponding to the cluster control plane service, which disables the ControlPlaneEndpoint.
What did you expect to happen: When reconciling the vspherecluster, CAPV should execute reconcileLoadBalancedEndpoint so it could create the virtualmachineservice if necessary.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Related code link https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/controllers/vmware/vspherecluster_reconciler.go#L203
Environment:
- Cluster-api-provider-vsphere version: v1.4.2
- Kubernetes version: (use
kubectl version
): v1.26.4 - OS (e.g. from
/etc/os-release
): photon 3.0
Sounds good! Feel free to open a PR
Good finding! For records, the current workaround to recover the whole cluster:
- Manually create a VirtualMachineService
and get the new accessible LoadBalancer IP address.
- Update the VSphereCluster.Spec.ControlPlaneEndpoint
with the IP address
Wait, doesn't this change the controlplane endpoint? Does this actually recover the cluster? Aren't all our kubelets using this endpoint?
Sorry please ignore this... it works in local env as the LoadBalance IP doesn't change...
After discussion with @zhanggbj, we feel we 'd better block the deletion instead of recreation.
Yup makes sense. I'm wondering how exactly :)
As this is a CRD of another controller, not sure how we can/should block the deletion
(we should probably talk to the vm-operator folks, IIRC that the CRD is from there)
The tricky part is how to distinguish between an intended and an accidental delete.
A different approach would be to make it more easy to discover this issue / bring up a condition or so at the VSphereCluster object, that the object does not exist anymore and for this first iteration still require manually fixing it.
From what I understand, the vm-operator isn't currently set up to handle VirtualMachineService deletions in its webhook. https://github.com/vmware-tanzu/vm-operator/blob/main/webhooks/virtualmachineservice/v1alpha2/validation/virtualmachineservice_validator.go#L103 Since CAPV is already in charge of the creation of VirtualMachineService for the TKG control plane, what if we also let CAPV handle safeguards against accidental deletions using use a webhook in ValidatingWebhookConfiguration.
Also, @chrischdi’s idea about using a condition inside VSphereCluster objects could work well in parallel, giving us an extra layer of visibility into this issue.
The tricky part is how to distinguish between an intended and an accidental delete.
(independent of where we implement the webhook)
Considering the cluster has a dependency on VirtualMachineService, VSphereCluster created VirtualMachineService but it's vm-operator who manages VirtualMachineService, I'm thinking in CAPV side:
- Add a finalizer when VSphereCluster to create VirtualMachineService.
- Remove the finalizer when deleting VSphereCluster.
- We should reach out to the vm-operator team to inform them about this finalizer.
Regarding the ValidationWebhook, I think it's more appropriate for the controller who manages the CR to implement one. In this case, it's vm-operator, but validation needs the cluster context to check whether to prevent deletion, as vm-operator is a separate controller, so I think it's more proper to add finalizer from CAPV side.
I think the idea of finalizers is to signal "oh there is some cleanup work to do for me here", not to block other controllers from doing there deletion.
Yes, I agree, that should indeed be the intention of the finalizer. However, I think it could be beneficial in this scenario to prevent the deletion of the VirtualMachineService until the VsphereCluster is deleted.
Another option is adding a ValidationWebhook within CAPV to reject delete requests for VirtualMachineService, as previously discussed.
I don't have strong opinions on either of these options. It appears to me that both options are trying to interfere with a CR managed by a separate controller, and both would require coordination with the vm-operator team.
Just adding a finaliser could still lead to users deleting the VirtualMachineService
. They still can use kubectl delete --force --grace-period=0
to delete the object, or kubectl delete
+ remove the finaliser.
Also: blocking the final deletion of the object with a finaliser may just lead to "the object is still around", but the VirtualMachineService
may already be deprovisioned or we may have other side effects.
I think the scope for CAPV for this issue ("VirtualMachineService
is gone" without CAPV having delete dit) should be to be able to handle all scenarios well:
- If the intent is to delete the Cluster: should still be able to delete it (user has to trigger it via
kubectl delete cluster
, if that's not already the case) - If the intent is to keep the cluster: Signal the issue and let the user manually fix the case (because there is no automatic way to solve this?!)
Also: blocking the final deletion of the object with a finaliser may just lead to "the object is still around", but the VirtualMachineService may already be deprovisioned or we may have other side effects.
Yup that's the point I was trying to make. We can keep the object around, but vm-operator will still go through the deletion. And it would be a strange dependency if the vm-operator skips deletion if our finalizer is present
If vm-operator itself provides some sort of hook to block deletion, that's a different discussion (e.g. like CAPI has a before cluster delete hook)
Now I get it, the key point is that adding a finalizer can not prevent the deletion of the underneath resources from vm-operator, so it is not a viable option to safeguard the VirtualMachineService from unintended deletions.
About another option as mentioned above, if I understand it correctly, leveraging a pre-deletion hook approach, by adding some annotation, vm-operator team will recognize it and pause the reconcile/delete until CAPV remove the annotation. Obviously it needs discussion with vm-operator team.
Yup, that would be the difference.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.