rancher [BUG->RFE] Issues Encountered During Graceful Reboot of vSphere Windows Nodes

Rancher Server Setup

Rancher version: n/a
Installation option (Docker install/Helm Chart): n/a
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2 with Windows Worker
Proxy/Cert Details:

Information about the Cluster

Kubernetes version: reported v1.24.4, partially fixed on newer versions
Cluster Type (Local/Downstream): Downstream, custom or vSphere node provisioned

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- If custom, define the set of permissions: n/a

Describe the bug When restarting windows nodes in the vSphere using using the graceful 'restart guest OS' button, Rancher UI may report a failed plan and the node may appear unavailable. In some cases the node will recover, however the failed plan will keep the node stuck in a failure state in the Rancher UI. This issue does not occur when running restart-computer, or restarting the node without using vmtools.

To Reproduce

Provision a cluster on vSphere, with 1 linux cp+etcd node and 1 windows worker node
In the vSphere UI, reboot the windows worker node using the 'Restart guest OS' option
occasionally, the Rancher UI will report a failure to apply a plan, and the node will be unavailable

Result the node reports a failed plan and is not usable / responsive from the Rancher UI. In certain cases, the node may recover and be operable, however Rancher will still mark it as 'failed' due to the initial plan failure.

Expected Result The worker node comes back online after the graceful reboot and is usable

Additional context This issue was partially addressed through the work done for ticket https://github.com/rancher/rancher/issues/39658, however the solution ran into unexpected regressions and had to be reverted in the most recent versions of RKE2. A new solution needs to be found which will not impact upgradability of the cluster.

SURE-6791

Aug 17 '23 16:08 HarrisonWAffel

moving this issue to the 2.8.0 backlog and will reevaluate the priority based on Harrison's capacity.

Sep 27 '23 18:09 Sahota1225

High Level Test Plan:

deploy windows cluster via vsphere node driver, then update the cluster
deploy rancher 2.8.0 with windows downstream cluster -> upgrade to latest rancher version -> upgrade windows cluster
restart guest OS for windows node -> wait for node to come back to active state
- observe node is gracefully rebooted through vsphere UI rather than forced

potential other test areas:

deploy a windows cluster with gracefulShutdownTimeout set on windows pools, delete node through UI and observe graceful shutdown through vsphere

supporting automation:

https://github.com/rancher/shepherd/pull/6 -> enable gracefulShutdownTimeout via automation, enhancements to vsphere automation in general
tests for ^ https://github.com/rancher/rancher/pull/43775
TODO: automate reboot node option : https://github.com/rancher/qa-tasks/issues/1098

Jan 09 '24 21:01 slickwarren

This issue may be resolved now that https://github.com/rancher/rke2/issues/2204 has been validated in RKE2. This should be retested once RKE2 feb patches are available

Feb 22 '24 15:02 HarrisonWAffel

tested on v2.8.3-rc5, I was still able to reproduce this issue. It appears that when the node is rebooted, the DNS name in vsphere is reset. This likely has something to do with the node not being able to reconnect to rancher, but I'm not 100% sure.

Screenshot from 2024-03-21 10-24-20

Mar 21 '24 17:03 slickwarren

Talked with @HarrisonWAffel and to properly address it we will need to essentially implement a new feature to allow for a configurable delay.

May 02 '24 19:05 snasovich

rancher rancher copied to clipboard

[BUG->RFE] Issues Encountered During Graceful Reboot of vSphere Windows Nodes

rancher
rancher copied to clipboard