cluster-api-provider-vsphere CAPV should be able to reconcile when `The vm was removed from infra`

/kind feature

Description

When a vspheremachine object no longer has a corresponding VM in the infrastructrue, it gets in a state where capv-manager will report The vm was removed from infra. As far as I can tell, this creates a deadlock where, attempting to delete the Machine object will get stuck in a Deleting state.

For example, consider a Machine : wkload-ale-md-0-9747bd4d5-lrfbs

Machine : wkload-ale-md-0-9747bd4d5-lrfbs : has an Infrastruct Ref to vspheremachine: wkload-ale-worker-zw8m5.

Observations:

Machine : wkload-ale-md-0-9747bd4d5-lrfbs is stuck in a Deleting state.

vspheremachine: wkload-ale-worker-zw8m5 has a failure message.

Failure Message:         Unable to find VM by BIOS UUID 420c97cc-c09c-4684-45c1-ba607dac0b5a. The vm was removed from infra
Failure Reason:          UpdateError

capv-manager is reporting skipping reconcile.

I0202 14:16:46.217370       1 vspheremachine_controller.go:313] capv-controller-manager/vspheremachine-controller/default/wkload-ale-worker-zw8m5 "msg"="Error state detected, skipping reconciliation"

After 24 hours, no progress is made.

Workaround

Delete the vspheremachine object (the machine object will remain stuck!)
```
kubectl delete vspheremachine wkload-ale-worker-zw8m5
```

Remove the finalizer from the corresponding machine

kubectl patch machine/wkload-ale-md-0-9747bd4d5-lrfbs -p '{"metadata":{"finalizers":[]}}' --type=merge

Wait. In ~1 minute, the machine should finally be removed.

Anything else you would like to add:

I was helping a user in #tanzu-community-edition Kubernetes slack.. This project uses cluster-api vsphere under the hood.

cc @alescuderi

Environment:

Cluster-api-provider-vsphere version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

Feb 02 '22 15:02 joshrosso

@joshrosso I might be missing the historical context, but in the past, we have always not allowed any direct modification on the underlying VM in the vCenter. This is a special case of VM modification wherein the VM might have gotten deleted/moved around.

From the POV of the user, if the VM is gone, then no point keeping the objects around. Maybe replace it with a new VM/delete the CAPV object or some other operation. Let me think about what the resolution should be.

cc: @yastij @timmycarr looking for inputs here.

Feb 03 '22 20:02 srm09

/kind api-change

Feb 03 '22 20:02 srm09

@yastij @timmycarr what would be a better resolution to handle this clean up? Should we do the creation of new VMs or let the VMs delete on its own?

Feb 18 '22 20:02 scdubey

/assign @scdubey Can you think of a way that we could use to reconcile deletion of VSphereMachine objects if the VM is missing or does not exist?

Feb 18 '22 23:02 srm09

I'll try to figure, It would be interesting if we have something similar to MachineHealthCheck to reconcile this

Feb 18 '22 23:02 scdubey

Is there a way we could leverage MHC to get out of this state? Worth looking into that?

Feb 23 '22 23:02 srm09

/unassign

May 08 '22 23:05 scdubey

/assign @aartij17

May 09 '22 00:05 srm09

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 03 '22 22:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Nov 11 '22 22:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Dec 19 '22 20:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 19 '22 20:12 k8s-ci-robot

cluster-api-provider-vsphere cluster-api-provider-vsphere copied to clipboard

CAPV should be able to reconcile when `The vm was removed from infra`

Description

Observations:

Workaround

Anything else you would like to add:

cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard