rancher icon indicating copy to clipboard operation
rancher copied to clipboard

[BUG->RFE] Issues Encountered During Graceful Reboot of vSphere Windows Nodes

Open HarrisonWAffel opened this issue 2 years ago • 7 comments

Rancher Server Setup

  • Rancher version: n/a
  • Installation option (Docker install/Helm Chart): n/a
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2 with Windows Worker
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version: reported v1.24.4, partially fixed on newer versions
  • Cluster Type (Local/Downstream): Downstream, custom or vSphere node provisioned

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions: n/a

Describe the bug When restarting windows nodes in the vSphere using using the graceful 'restart guest OS' button, Rancher UI may report a failed plan and the node may appear unavailable. In some cases the node will recover, however the failed plan will keep the node stuck in a failure state in the Rancher UI. This issue does not occur when running restart-computer, or restarting the node without using vmtools.

To Reproduce

  • Provision a cluster on vSphere, with 1 linux cp+etcd node and 1 windows worker node
  • In the vSphere UI, reboot the windows worker node using the 'Restart guest OS' option
  • occasionally, the Rancher UI will report a failure to apply a plan, and the node will be unavailable

Result the node reports a failed plan and is not usable / responsive from the Rancher UI. In certain cases, the node may recover and be operable, however Rancher will still mark it as 'failed' due to the initial plan failure.

Expected Result The worker node comes back online after the graceful reboot and is usable

Additional context This issue was partially addressed through the work done for ticket https://github.com/rancher/rancher/issues/39658, however the solution ran into unexpected regressions and had to be reverted in the most recent versions of RKE2. A new solution needs to be found which will not impact upgradability of the cluster.

SURE-6791

HarrisonWAffel avatar Aug 17 '23 16:08 HarrisonWAffel

moving this issue to the 2.8.0 backlog and will reevaluate the priority based on Harrison's capacity.

Sahota1225 avatar Sep 27 '23 18:09 Sahota1225

High Level Test Plan:

  • deploy windows cluster via vsphere node driver, then update the cluster
  • deploy rancher 2.8.0 with windows downstream cluster -> upgrade to latest rancher version -> upgrade windows cluster
  • restart guest OS for windows node -> wait for node to come back to active state
    • observe node is gracefully rebooted through vsphere UI rather than forced

potential other test areas:

  • deploy a windows cluster with gracefulShutdownTimeout set on windows pools, delete node through UI and observe graceful shutdown through vsphere

supporting automation:

  • https://github.com/rancher/shepherd/pull/6 -> enable gracefulShutdownTimeout via automation, enhancements to vsphere automation in general
  • tests for ^ https://github.com/rancher/rancher/pull/43775
  • TODO: automate reboot node option : https://github.com/rancher/qa-tasks/issues/1098

slickwarren avatar Jan 09 '24 21:01 slickwarren

This issue may be resolved now that https://github.com/rancher/rke2/issues/2204 has been validated in RKE2. This should be retested once RKE2 feb patches are available

HarrisonWAffel avatar Feb 22 '24 15:02 HarrisonWAffel

tested on v2.8.3-rc5, I was still able to reproduce this issue. It appears that when the node is rebooted, the DNS name in vsphere is reset. This likely has something to do with the node not being able to reconnect to rancher, but I'm not 100% sure.

Screenshot from 2024-03-21 10-24-20 Screenshot from 2024-03-21 10-24-10

slickwarren avatar Mar 21 '24 17:03 slickwarren

Talked with @HarrisonWAffel and to properly address it we will need to essentially implement a new feature to allow for a configurable delay.

snasovich avatar May 02 '24 19:05 snasovich