cluster-api-provider-vsphere Ensure reconcileLoadBalancedEndpoint Function Gets Called Despite Setting of ControlPlaneEndpoint

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.] The current functionality of CAPV bypasses the reconcileLoadBalancedEndpoint if the ControlPlaneEndpoint is set in either the cluster or vspherecluster. Nonetheless, undesirable scenarios may arise, such as the accidental deletion of the virtualmachineservice corresponding to the cluster control plane service, which disables the ControlPlaneEndpoint.

What did you expect to happen: When reconciling the vspherecluster, CAPV should execute reconcileLoadBalancedEndpoint so it could create the virtualmachineservice if necessary.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Related code link https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/controllers/vmware/vspherecluster_reconciler.go#L203

Environment:

Cluster-api-provider-vsphere version: v1.4.2
Kubernetes version: (use kubectl version): v1.26.4
OS (e.g. from /etc/os-release): photon 3.0

Apr 03 '24 08:04 silvery1622

Sounds good! Feel free to open a PR

Apr 03 '24 08:04 sbueringer

~~Good finding! For records, the current workaround to recover the whole cluster:~~

~~- Manually create a VirtualMachineService and get the new accessible LoadBalancer IP address. ~~- Update the VSphereCluster.Spec.ControlPlaneEndpoint with the IP address~~~~

Apr 03 '24 09:04 zhanggbj

Wait, doesn't this change the controlplane endpoint? Does this actually recover the cluster? Aren't all our kubelets using this endpoint?

Apr 03 '24 09:04 sbueringer

Sorry please ignore this... it works in local env as the LoadBalance IP doesn't change...

Apr 03 '24 09:04 zhanggbj

After discussion with @zhanggbj, we feel we 'd better block the deletion instead of recreation.

Apr 03 '24 09:04 silvery1622

Yup makes sense. I'm wondering how exactly :)

As this is a CRD of another controller, not sure how we can/should block the deletion

(we should probably talk to the vm-operator folks, IIRC that the CRD is from there)

The tricky part is how to distinguish between an intended and an accidental delete.

Apr 03 '24 09:04 sbueringer

A different approach would be to make it more easy to discover this issue / bring up a condition or so at the VSphereCluster object, that the object does not exist anymore and for this first iteration still require manually fixing it.

Apr 03 '24 09:04 chrischdi

From what I understand, the vm-operator isn't currently set up to handle VirtualMachineService deletions in its webhook. https://github.com/vmware-tanzu/vm-operator/blob/main/webhooks/virtualmachineservice/v1alpha2/validation/virtualmachineservice_validator.go#L103 Since CAPV is already in charge of the creation of VirtualMachineService for the TKG control plane, what if we also let CAPV handle safeguards against accidental deletions using use a webhook in ValidatingWebhookConfiguration.

Also, @chrischdi’s idea about using a condition inside VSphereCluster objects could work well in parallel, giving us an extra layer of visibility into this issue.

Apr 08 '24 03:04 silvery1622

The tricky part is how to distinguish between an intended and an accidental delete.

(independent of where we implement the webhook)

Apr 08 '24 06:04 sbueringer

Considering the cluster has a dependency on VirtualMachineService, VSphereCluster created VirtualMachineService but it's vm-operator who manages VirtualMachineService, I'm thinking in CAPV side:

Add a finalizer when VSphereCluster to create VirtualMachineService.
Remove the finalizer when deleting VSphereCluster.
We should reach out to the vm-operator team to inform them about this finalizer.

Regarding the ValidationWebhook, I think it's more appropriate for the controller who manages the CR to implement one. In this case, it's vm-operator, but validation needs the cluster context to check whether to prevent deletion, as vm-operator is a separate controller, so I think it's more proper to add finalizer from CAPV side.

Apr 10 '24 06:04 zhanggbj

I think the idea of finalizers is to signal "oh there is some cleanup work to do for me here", not to block other controllers from doing there deletion.

Apr 10 '24 09:04 sbueringer

Yes, I agree, that should indeed be the intention of the finalizer. However, I think it could be beneficial in this scenario to prevent the deletion of the VirtualMachineService until the VsphereCluster is deleted.

Another option is adding a ValidationWebhook within CAPV to reject delete requests for VirtualMachineService, as previously discussed.

I don't have strong opinions on either of these options. It appears to me that both options are trying to interfere with a CR managed by a separate controller, and both would require coordination with the vm-operator team.

Apr 11 '24 07:04 zhanggbj

Just adding a finaliser could still lead to users deleting the VirtualMachineService. They still can use kubectl delete --force --grace-period=0 to delete the object, or kubectl delete + remove the finaliser.

Also: blocking the final deletion of the object with a finaliser may just lead to "the object is still around", but the VirtualMachineService may already be deprovisioned or we may have other side effects.

I think the scope for CAPV for this issue ("VirtualMachineService is gone" without CAPV having delete dit) should be to be able to handle all scenarios well:

If the intent is to delete the Cluster: should still be able to delete it (user has to trigger it via kubectl delete cluster, if that's not already the case)
If the intent is to keep the cluster: Signal the issue and let the user manually fix the case (because there is no automatic way to solve this?!)

Apr 11 '24 09:04 chrischdi

Also: blocking the final deletion of the object with a finaliser may just lead to "the object is still around", but the VirtualMachineService may already be deprovisioned or we may have other side effects.

Yup that's the point I was trying to make. We can keep the object around, but vm-operator will still go through the deletion. And it would be a strange dependency if the vm-operator skips deletion if our finalizer is present

If vm-operator itself provides some sort of hook to block deletion, that's a different discussion (e.g. like CAPI has a before cluster delete hook)

Apr 11 '24 11:04 sbueringer

Now I get it, the key point is that adding a finalizer can not prevent the deletion of the underneath resources from vm-operator, so it is not a viable option to safeguard the VirtualMachineService from unintended deletions.

About another option as mentioned above, if I understand it correctly, leveraging a pre-deletion hook approach, by adding some annotation, vm-operator team will recognize it and pause the reconcile/delete until CAPV remove the annotation. Obviously it needs discussion with vm-operator team.

Apr 15 '24 09:04 zhanggbj

Yup, that would be the difference.

Apr 15 '24 10:04 sbueringer

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 17 '24 14:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Aug 16 '24 14:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Sep 15 '24 15:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sep 15 '24 15:09 k8s-ci-robot

cluster-api-provider-vsphere cluster-api-provider-vsphere copied to clipboard

Ensure reconcileLoadBalancedEndpoint Function Gets Called Despite Setting of ControlPlaneEndpoint

cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard