cluster-api-provider-vsphere icon indicating copy to clipboard operation
cluster-api-provider-vsphere copied to clipboard

CAPV should be able to reconcile when `The vm was removed from infra`

Open joshrosso opened this issue 3 years ago • 9 comments

/kind feature

Description

When a vspheremachine object no longer has a corresponding VM in the infrastructrue, it gets in a state where capv-manager will report The vm was removed from infra. As far as I can tell, this creates a deadlock where, attempting to delete the Machine object will get stuck in a Deleting state.

For example, consider a Machine : wkload-ale-md-0-9747bd4d5-lrfbs

Machine : wkload-ale-md-0-9747bd4d5-lrfbs : has an Infrastruct Ref to vspheremachine: wkload-ale-worker-zw8m5.

Observations:

  • Machine : wkload-ale-md-0-9747bd4d5-lrfbs is stuck in a Deleting state.
  • vspheremachine: wkload-ale-worker-zw8m5 has a failure message.
    Failure Message:         Unable to find VM by BIOS UUID 420c97cc-c09c-4684-45c1-ba607dac0b5a. The vm was removed from infra
    Failure Reason:          UpdateError
    
  • capv-manager is reporting skipping reconcile.
    I0202 14:16:46.217370       1 vspheremachine_controller.go:313] capv-controller-manager/vspheremachine-controller/default/wkload-ale-worker-zw8m5 "msg"="Error state detected, skipping reconciliation"  
    
  • After 24 hours, no progress is made.

Workaround

  1. Delete the vspheremachine object (the machine object will remain stuck!)

    kubectl delete vspheremachine wkload-ale-worker-zw8m5
    
  2. Remove the finalizer from the corresponding machine

    kubectl patch machine/wkload-ale-md-0-9747bd4d5-lrfbs -p '{"metadata":{"finalizers":[]}}' --type=merge
    
  3. Wait. In ~1 minute, the machine should finally be removed.

Anything else you would like to add:

I was helping a user in #tanzu-community-edition Kubernetes slack.. This project uses cluster-api vsphere under the hood.

cc @alescuderi

Environment:

  • Cluster-api-provider-vsphere version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

joshrosso avatar Feb 02 '22 15:02 joshrosso

@joshrosso I might be missing the historical context, but in the past, we have always not allowed any direct modification on the underlying VM in the vCenter. This is a special case of VM modification wherein the VM might have gotten deleted/moved around.

From the POV of the user, if the VM is gone, then no point keeping the objects around. Maybe replace it with a new VM/delete the CAPV object or some other operation. Let me think about what the resolution should be.

cc: @yastij @timmycarr looking for inputs here.

srm09 avatar Feb 03 '22 20:02 srm09

/kind api-change

srm09 avatar Feb 03 '22 20:02 srm09

@yastij @timmycarr what would be a better resolution to handle this clean up? Should we do the creation of new VMs or let the VMs delete on its own?

scdubey avatar Feb 18 '22 20:02 scdubey

/assign @scdubey Can you think of a way that we could use to reconcile deletion of VSphereMachine objects if the VM is missing or does not exist?

srm09 avatar Feb 18 '22 23:02 srm09

I'll try to figure, It would be interesting if we have something similar to MachineHealthCheck to reconcile this

scdubey avatar Feb 18 '22 23:02 scdubey

Is there a way we could leverage MHC to get out of this state? Worth looking into that?

srm09 avatar Feb 23 '22 23:02 srm09

/unassign

scdubey avatar May 08 '22 23:05 scdubey

/assign @aartij17

srm09 avatar May 09 '22 00:05 srm09

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 03 '22 22:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 11 '22 22:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Dec 19 '22 20:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 19 '22 20:12 k8s-ci-robot