rancher
rancher copied to clipboard
[BUG->RFE] Issues Encountered During Graceful Reboot of vSphere Windows Nodes
Rancher Server Setup
- Rancher version: n/a
- Installation option (Docker install/Helm Chart): n/a
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2 with Windows Worker
- Proxy/Cert Details:
Information about the Cluster
- Kubernetes version: reported v1.24.4, partially fixed on newer versions
- Cluster Type (Local/Downstream): Downstream, custom or vSphere node provisioned
User Information
- What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- If custom, define the set of permissions: n/a
Describe the bug
When restarting windows nodes in the vSphere using using the graceful 'restart guest OS' button, Rancher UI may report a failed plan and the node may appear unavailable. In some cases the node will recover, however the failed plan will keep the node stuck in a failure state in the Rancher UI. This issue does not occur when running restart-computer, or restarting the node without using vmtools.
To Reproduce
- Provision a cluster on vSphere, with 1 linux cp+etcd node and 1 windows worker node
- In the vSphere UI, reboot the windows worker node using the 'Restart guest OS' option
- occasionally, the Rancher UI will report a failure to apply a plan, and the node will be unavailable
Result the node reports a failed plan and is not usable / responsive from the Rancher UI. In certain cases, the node may recover and be operable, however Rancher will still mark it as 'failed' due to the initial plan failure.
Expected Result The worker node comes back online after the graceful reboot and is usable
Additional context This issue was partially addressed through the work done for ticket https://github.com/rancher/rancher/issues/39658, however the solution ran into unexpected regressions and had to be reverted in the most recent versions of RKE2. A new solution needs to be found which will not impact upgradability of the cluster.
SURE-6791
moving this issue to the 2.8.0 backlog and will reevaluate the priority based on Harrison's capacity.
High Level Test Plan:
- deploy windows cluster via vsphere node driver, then update the cluster
- deploy rancher 2.8.0 with windows downstream cluster -> upgrade to latest rancher version -> upgrade windows cluster
restart guest OSfor windows node -> wait for node to come back to active state- observe node is gracefully rebooted through vsphere UI rather than forced
potential other test areas:
- deploy a windows cluster with gracefulShutdownTimeout set on windows pools, delete node through UI and observe graceful shutdown through vsphere
supporting automation:
- https://github.com/rancher/shepherd/pull/6 -> enable gracefulShutdownTimeout via automation, enhancements to vsphere automation in general
- tests for ^ https://github.com/rancher/rancher/pull/43775
- TODO: automate reboot node option : https://github.com/rancher/qa-tasks/issues/1098
This issue may be resolved now that https://github.com/rancher/rke2/issues/2204 has been validated in RKE2. This should be retested once RKE2 feb patches are available
tested on v2.8.3-rc5, I was still able to reproduce this issue. It appears that when the node is rebooted, the DNS name in vsphere is reset. This likely has something to do with the node not being able to reconnect to rancher, but I'm not 100% sure.
Talked with @HarrisonWAffel and to properly address it we will need to essentially implement a new feature to allow for a configurable delay.