rancher
rancher copied to clipboard
[BUG] Multiple server nodes pre-drains in an RKE2 upgrade
Rancher Server Setup
- Rancher version: v2.6.9-rc2
- Installation option (Docker install/Helm Chart):
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): v1.22.12+rke2r1 (Upgrade to v1.23.12+rke2r1)
- Proxy/Cert Details:
Information about the Cluster
- Kubernetes version: v1.22.12+rke2r1 (Upgrade to v1.23.12+rke2r1)
- Cluster Type (Local/Downstream): local
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):
User Information
- What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- If custom, define the set of permissions:
Describe the bug
To Reproduce
- We trigger an RKE2 upgrade in Harvester (with pre-drain/post-drain hook) in a 4-nodes cluster (3 server, 1 worker):
$kubectl edit clusters.provisioning.cattle.io local -n fleet-local
And edit local cluster with:
spec:
kubernetesVersion: v1.23.12+rke2r1
localClusterAuthEndpoint: {}
rkeConfig:
chartValues: null
machineGlobalConfig: null
provisionGeneration: 1
upgradeStrategy:
controlPlaneConcurrency: "1"
controlPlaneDrainOptions:
deleteEmptyDirData: true
enabled: true
force: true
ignoreDaemonSets: true
postDrainHooks:
- annotation: harvesterhci.io/post-hook
preDrainHooks:
- annotation: harvesterhci.io/pre-hook
timeout: 0
workerConcurrency: "1"
workerDrainOptions:
deleteEmptyDirData: true
enabled: true
force: true
ignoreDaemonSets: true
postDrainHooks:
- annotation: harvesterhci.io/post-hook
preDrainHooks:
- annotation: harvesterhci.io/pre-hook
timeout: 0
Result
We observe after the first node is upgraded, there is a high chance the rest two server nodes' scheduling are all disabled. And we see Rancher added pre-drain hooks annotation on plan secrets, which indicates pre-drain signal.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
node1 Ready control-plane,etcd,master 21d v1.23.12+rke2r1 <-- upgraded
node2 Ready,SchedulingDisabled control-plane,etcd,master 21d v1.23.12+rke2r1 <--
node3 Ready <none> 21d v1.22.12+rke2r1
node4 Ready,SchedulingDisabled control-plane,etcd,master 21d v1.22.12+rke2r1. <--
Expected Result
Only a single server should be disabled.
Screenshots
Additional context
Some observation:
- Node2 and node4's machine plan secrets have
rke.cattle.io/pre-drain
annotation set.
$ kubectl get machine -A
NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
fleet-local custom-24d57cc6f506 local node1 rke2://node1 Running 21d
fleet-local custom-3865d0441591 local node2 rke2://node2 Running 21d
fleet-local custom-3994bff0f3f3 local node3 rke2://node3 Running 21d
fleet-local custom-fda201f64657 local node4 rke2://node4 Running 21d
$ kubectl get secret custom-3865d0441591-machine-plan -n fleet-local -o json | jq '.metadata.annotations."rke.cattle.io/pre-drain"'
"{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"preDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}"
$ kubectl get secret custom-fda201f64657-machine-plan -n fleet-local -o json | jq '.metadata.annotations."rke.cattle.io/pre-drain"'
"{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"preDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}"
- A similar issue was spotted and fixed a while ago: https://github.com/rancher/rancher/issues/35999, but in that issue, the in question nodes are one server and one worker, not all servers.
- rancher_pod_logs.zip
@bk201 I'm struggling to reproduce this issue.
Do you think you would be able to provide an environment where this happens?
@Oats87 I'll try to create one.
Seems there is a weird bug here that can occasionally cause this. Unfortunately, it is not easy to reproduce, and I have not been able to reproduce it.
Since this isn't reproducible and has been occurring in previous versions, the release blocker label has been removed.
What Harvester
sets when upgrading is:
toUpdate.Spec.RKEConfig.ProvisionGeneration += 1
toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneConcurrency = "1"
toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerConcurrency = "1"
toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.DeleteEmptyDirData = rke2DrainNodes
toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.Enabled = rke2DrainNodes
toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.Force = rke2DrainNodes
toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.IgnoreDaemonSets = &rke2DrainNodes
toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.DeleteEmptyDirData = rke2DrainNodes
toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.Enabled = rke2DrainNodes
toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.Force = rke2DrainNodes
toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.IgnoreDaemonSets = &rke2DrainNodes
According to upgrade setting, control-plane has at most 1 node in upgrade; worker is same.
But from the node status, 2 control-node are in upgrading in the same time, it means rancher's control of upgrading sequence is broken.
https://github.com/harvester/harvester/issues/2907
node-0:~ # k get no
NAME STATUS ROLES AGE VERSION
node-0 Ready control-plane,etcd,master 5d11h v1.24.6+rke2r1
node-1 Ready,SchedulingDisabled control-plane,etcd,master 5d10h v1.24.6+rke2r1
node-2 Ready,SchedulingDisabled control-plane,etcd,master 5d10h v1.22.12+rke2r1
node-3 Ready <none> 5d10h v1.22.12+rke2r1
The current Harvester
fix, as a workaround, works, and it may bring another question:
As Rancher starts the second upgrade of control-plane node earlier than expected, but Harvester suspends it, thus may cuase finally Rancher report timeout of this node. @Oats87 is it possible? thanks.
cc @bk201 @https://github.com/starbops
From the support-bundle
attached in https://github.com/harvester/harvester/issues/2907#issue-1404168567,
in logs/cattle-system/rancher-59cd8bb8f7-hmmbq/rancher.log
, it shows mutli-nodes are draining at same time
The first line of log from planner
is:
2022-10-11T11:23:09.471682431Z 2022/10/11 11:23:09 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
,
those 2 custom-929d403d1670,custom-c05d0d11190c
are both control-plan nodes
, they are draining
at same time.
2022-10-11T11:23:09.471682431Z 2022/10/11 11:23:09 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:09.474096731Z 2022/10/11 11:23:09 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:10.762326566Z 2022/10/11 11:23:10 [ERROR] Failed to read API for groups map[autoscaling/v2:the server could not find the requested resource flowcontrol.apiserver.k8s.io/v1beta2:the server could not find the requested resource]
2022-10-11T11:23:14.327719194Z 2022/10/11 11:23:14 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:14.387075520Z 2022/10/11 11:23:14 [INFO] Watching metadata for autoscaling/v2beta1, Kind=HorizontalPodAutoscaler
2022-10-11T11:23:14.387115743Z 2022/10/11 11:23:14 [INFO] Stopping metadata watch on autoscaling/v1, Kind=HorizontalPodAutoscaler
2022-10-11T11:23:15.086488451Z 2022/10/11 11:23:15 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:15.086524671Z 2022/10/11 11:23:15 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:16.658870231Z 2022/10/11 11:23:16 [ERROR] Failed to handle tunnel request from remote address 10.52.1.49:53496: response 401: failed authentication
2022-10-11T11:23:19.447217540Z 2022/10/11 11:23:19 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:19.545281217Z 2022/10/11 11:23:19 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:21.681837805Z 2022/10/11 11:23:21 [ERROR] Failed to handle tunnel request from remote address 10.52.1.49:46494: response 401: failed authentication
2022-10-11T11:23:26.686803004Z 2022/10/11 11:23:26 [ERROR] Failed to handle tunnel request from remote address 10.52.1.49:46506: response 401: failed authentication
2022-10-11T11:23:29.436589166Z 2022/10/11 11:23:29 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
node-0:~ # k -n fleet-local get machines
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
custom-2d94d5d682dc local node-3 rke2://node-3 Running 5d13h // worker
custom-7c1afab6e79d local node-0 rke2://node-0 Running 5d14h
custom-929d403d1670 local node-1 rke2://node-1 Running 5d14h // control-plane
custom-c05d0d11190c local node-2 rke2://node-2 Running 5d13h // control-plane
Today, I did another round of Harvester upgrade on a 4-node cluster and tried my best to collect all the rancher pods' logs with a simple script while upgrading:
Before rancher upgrade:
- rancher-7fd549bcc4-5twfg.txt
- rancher-7fd549bcc4-j8m7r.txt
- rancher-7fd549bcc4-dzdw8.txt
After rancher upgrade:
- rancher-65f8899dfb-ksnkx.txt
- rancher-65f8899dfb-58k7p.txt
- rancher-65f8899dfb-87k5t.txt
- rancher-65f8899dfb-26b6s.txt
- rancher-65f8899dfb-nf7p4.txt
- rancher-65f8899dfb-zrvsk.txt
- rancher-65f8899dfb-qkvd6.txt
- rancher-65f8899dfb-4mxd9.txt
In the middle of the upgrade, there was indeed a multi-node SchedulingDisable situation (node-1 & node-2) after the first node (node-0) was upgraded and rebooted. But we had a workaround code snippet deployed in the upgrade controller so the whole upgrade did not get stuck forever, it eventually went through to the end.
Here are some of the information that you can reference with the logs:
$ k -n fleet-local get machines
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
custom-1b287700d314 local node-2 rke2://node-2 Running 25h // control-plane
custom-57aefc97a78e local node-3 rke2://node-3 Running 25h // worker
custom-ad79796f3d2a local node-1 rke2://node-1 Running 25h // control-plane
custom-cd36cfbeabf7 local node-0 rke2://node-0 Running 25h // control-plane (bootstrap node)
- Rancher upgrade from v2.6.4 to v2.6.9-rc5
- RKE2 upgrade from v1.22.12+rke2r1 to v1.24.7+rke2r1
cc @w13915984028
According to the source code
https://github.com/rancher/rancher/blob/release/v2.7/pkg/provisioningv2/rke2/planner/planner.go#L352
err = p.reconcile(controlPlane, clusterSecretTokens, plan, true, etcdTier, isEtcd, isInitNodeOrDeleting, "1", joinServer, controlPlane.Spec.UpgradeStrategy.ControlPlaneDrainOptions)
...
err = p.reconcile(controlPlane, clusterSecretTokens, plan, true, controlPlaneTier, isControlPlane, isInitNodeOrDeleting, controlPlane.Spec.UpgradeStrategy.ControlPlaneConcurrency, joinServer, controlPlane.Spec.UpgradeStrategy.ControlPlaneDrainOptions)
etcdTier, controlPlaneTier
are fetched 1 in each tier
but they may share the same nodes (e.g. 3 management-node), thus breaks the control policy ControlPlaneConcurrency = "1"
it could be: after the init node is upgraded, it will upgrade another 2 in parallel. sometimes, it will be successful, sometimes not
@starbops Your last test log shows that.
@Oats87 Are the above comments helpful, or do you still need a live environment reproducing this issue? Thanks!
@bk201 I've been working to try and reproduce this but I have not been able to do so. Have you folks found an accurate reproducer for this?
@Oats87 We'll try to create one and get back to you. Thanks!
Hi @Oats87, I successfully reproduced the issue on a 3-node Harvester cluster in our environment, though it's not always reproducible. I left the environment intact maybe you are interested in looking into it.

For simplicity and to avoid the lengthy upgrade process, I didn't trigger the normal upgrade flow of Harvester. Instead, I did the following (only upgrade RKE2):
- Prepare a v1.0.3 Harvester cluster (RKE2 version is v1.22.12+rke2r1, Rancher version is v2.6.4-harvester3)
- Upgrade Rancher to v2.6.9 with the following script
#!/usr/bin/env sh
set -ex
trap cleanup EXIT
cleanup() {
if [ -n "$TEMP_DIR" ]; then
\rm -vrf "$TEMP_DIR"
fi
}
RANCHER_VERSION=${1:-v2.6.9}
TEMP_DIR=$(mktemp -d -p /tmp)
wharfie rancher/system-agent-installer-rancher:"$RANCHER_VERSION" "$TEMP_DIR"
pushd "$TEMP_DIR"
helm upgrade rancher ./rancher-"${RANCHER_VERSION#v}".tgz --reuse-values --set rancherImageTag="$RANCHER_VERSION" --namespace cattle-system --wait
popd
kubectl -n cattle-system rollout status deploy rancher
- Simulate the Harvester upgrade by patching
clusters.provisioning.cattle.io
with the commandkubectl -n fleet-local patch clusters.provisioning.cattle.io local --type merge --patch-file ./upgrade-patch.yaml
. The patch is like the following:
spec:
kubernetesVersion: v1.24.7+rke2r1
rkeConfig:
provisionGeneration: 1
upgradeStrategy:
controlPlaneConcurrency: "1"
workerConcurrency: "1"
controlPlaneDrainOptions:
deleteEmptyDirData: true
enabled: true
force: true
ignoreDaemonSets: true
postDrainHooks:
- annotation: "harvesterhci.io/post-hook"
preDrainHooks:
- annotation: "harvesterhci.io/pre-hook"
workerDrainOptions:
deleteEmptyDirData: true
enabled: true
force: true
ignoreDaemonSets: true
postDrainHooks:
- annotation: "harvesterhci.io/post-hook"
preDrainHooks:
- annotation: "harvesterhci.io/pre-hook"
- The upgrade starts on the first node,
harvester-node-0
. The machine secretcustom-7bb31dfaa3bb-machine-plan
hasrke.cattle.io/pre-drain
annotated. - Manually annotate
custom-7bb31dfaa3bb-machine-plan
withharvesterhci.io/pre-hook
just like the normal upgrade flow of Harvester does - The first node,
harvester-node-0
starts to drain the pods - After the drain is done,
custom-7bb31dfaa3bb-machine-plan
is annotated withrke.cattle.io/post-drain
- Manually annotate
custom-7bb31dfaa3bb-machine-plan
withharvesterhci.io/post-hook
just like the normal upgrade flow of Harvester does - The first node upgrade is done
- The upgrade start on the second node,
harvester-node-2
. The machine secretcustom-bb2ddb6fb772-machine-plan
hasrke.cattle.io/pre-drain
annotated. - Manually annotate
custom-bb2ddb6fb772-machine-plan
withharvesterhci.io/pre-hook
- The second node,
harvester-node-2
starts to drain the pods - The second node drain is done,
custom-bb2ddb6fb772-machine-plan
is annotated withrke.cattle.io/post-drain
- Somehow, the third node,
harvester-node-1
is cordoned off and hasrke.cattle.io/pre-drain
annotated
The support bundle is here:
supportbundle_12c2d5c7-956a-4e26-bebb-dd4ec43dc5d8_2022-12-09T06-14-20Z.zip
P.S. I have tried this iteration several times, but it did not happen the issue, until now. But it's more frequent when executing a regular Harvester upgrade.
With trace logs enabled on rancher, I reproduced the issue with the same methods in the same environment. Here's the support bundle: supportbundle_12c2d5c7-956a-4e26-bebb-dd4ec43dc5d8_2022-12-20T03-21-52Z.zip
Hope that helps!
I believe I have identified why this is occurring. Huge shout out to @starbops for helping me debug this/gathering me the corresponding logs for this.
https://github.com/rancher/rancher/pull/39101/commits/c6b6afd1d9147f8851505354dc0d1c0179faf2aa is a commit that introduces logic that attempts to continue determining draining status/update a plan if a plan has been applied but probes are failing. This seems to introduce an edge case where a valid but "old" plan may start having its probes fail (which is very possible to happen when the init node is restarted for example), causing the planner to attempt to drain that node.
I'll need to think of how to prevent this edge case while also accommodating the original desired business logic defined in the PR/commit.
https://github.com/rancher/rancher/pull/41459 reverts the addition of the planAppliedButWaitingForProbes
short circuiting
We can confirm the issue doesn't happen recently after bumping to Rancher 2.7.5-rc releases; thanks!