rancher [BUG] Multiple server nodes pre-drains in an RKE2 upgrade

Rancher Server Setup

Rancher version: v2.6.9-rc2
Installation option (Docker install/Helm Chart):
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): v1.22.12+rke2r1 (Upgrade to v1.23.12+rke2r1)
Proxy/Cert Details:

Information about the Cluster

Kubernetes version: v1.22.12+rke2r1 (Upgrade to v1.23.12+rke2r1)
Cluster Type (Local/Downstream): local
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- If custom, define the set of permissions:

Describe the bug

To Reproduce

We trigger an RKE2 upgrade in Harvester (with pre-drain/post-drain hook) in a 4-nodes cluster (3 server, 1 worker):

$kubectl edit clusters.provisioning.cattle.io local -n fleet-local

And edit local cluster with:

spec:
  kubernetesVersion: v1.23.12+rke2r1
  localClusterAuthEndpoint: {}
  rkeConfig:
    chartValues: null
    machineGlobalConfig: null
    provisionGeneration: 1
    upgradeStrategy:
      controlPlaneConcurrency: "1"
      controlPlaneDrainOptions:
        deleteEmptyDirData: true
        enabled: true
        force: true
        ignoreDaemonSets: true
        postDrainHooks:
        - annotation: harvesterhci.io/post-hook
        preDrainHooks:
        - annotation: harvesterhci.io/pre-hook
        timeout: 0
      workerConcurrency: "1"
      workerDrainOptions:
        deleteEmptyDirData: true
        enabled: true
        force: true
        ignoreDaemonSets: true
        postDrainHooks:
        - annotation: harvesterhci.io/post-hook
        preDrainHooks:
        - annotation: harvesterhci.io/pre-hook
        timeout: 0

Result

We observe after the first node is upgraded, there is a high chance the rest two server nodes' scheduling are all disabled. And we see Rancher added pre-drain hooks annotation on plan secrets, which indicates pre-drain signal.

$ kubectl get nodes
NAME    STATUS                     ROLES                       AGE   VERSION
node1   Ready                      control-plane,etcd,master   21d   v1.23.12+rke2r1 <-- upgraded
node2   Ready,SchedulingDisabled   control-plane,etcd,master   21d   v1.23.12+rke2r1  <--
node3   Ready                      <none>                      21d   v1.22.12+rke2r1
node4   Ready,SchedulingDisabled   control-plane,etcd,master   21d   v1.22.12+rke2r1. <--

Expected Result

Only a single server should be disabled.

Screenshots

Additional context

Some observation:

Node2 and node4's machine plan secrets have rke.cattle.io/pre-drain annotation set.

$ kubectl get machine -A
NAMESPACE     NAME                  CLUSTER   NODENAME   PROVIDERID     PHASE     AGE   VERSION
fleet-local   custom-24d57cc6f506   local     node1      rke2://node1   Running   21d
fleet-local   custom-3865d0441591   local     node2      rke2://node2   Running   21d
fleet-local   custom-3994bff0f3f3   local     node3      rke2://node3   Running   21d
fleet-local   custom-fda201f64657   local     node4      rke2://node4   Running   21d

$ kubectl get secret custom-3865d0441591-machine-plan -n fleet-local -o json | jq '.metadata.annotations."rke.cattle.io/pre-drain"'
"{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"preDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}"

$ kubectl get secret custom-fda201f64657-machine-plan -n fleet-local -o json | jq '.metadata.annotations."rke.cattle.io/pre-drain"'
"{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"preDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}"

A similar issue was spotted and fixed a while ago: https://github.com/rancher/rancher/issues/35999, but in that issue, the in question nodes are one server and one worker, not all servers.
rancher_pod_logs.zip

Sep 29 '22 06:09 bk201

@bk201 I'm struggling to reproduce this issue.

Do you think you would be able to provide an environment where this happens?

Sep 30 '22 22:09 Oats87

@Oats87 I'll try to create one.

Oct 03 '22 06:10 bk201

Seems there is a weird bug here that can occasionally cause this. Unfortunately, it is not easy to reproduce, and I have not been able to reproduce it.

Oct 06 '22 00:10 Oats87

Since this isn't reproducible and has been occurring in previous versions, the release blocker label has been removed.

Oct 12 '22 16:10 deniseschannon

What Harvester sets when upgrading is:

	toUpdate.Spec.RKEConfig.ProvisionGeneration += 1
	toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneConcurrency = "1"
	toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerConcurrency = "1"
	toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.DeleteEmptyDirData = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.Enabled = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.Force = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.IgnoreDaemonSets = &rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.DeleteEmptyDirData = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.Enabled = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.Force = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.IgnoreDaemonSets = &rke2DrainNodes

According to upgrade setting, control-plane has at most 1 node in upgrade; worker is same.

But from the node status, 2 control-node are in upgrading in the same time, it means rancher's control of upgrading sequence is broken.

https://github.com/harvester/harvester/issues/2907

node-0:~ # k get no
NAME     STATUS                     ROLES                       AGE     VERSION
node-0   Ready                      control-plane,etcd,master   5d11h   v1.24.6+rke2r1
node-1   Ready,SchedulingDisabled   control-plane,etcd,master   5d10h   v1.24.6+rke2r1
node-2   Ready,SchedulingDisabled   control-plane,etcd,master   5d10h   v1.22.12+rke2r1
node-3   Ready                      <none>                      5d10h   v1.22.12+rke2r1

The current Harvester fix, as a workaround, works, and it may bring another question:

As Rancher starts the second upgrade of control-plane node earlier than expected, but Harvester suspends it, thus may cuase finally Rancher report timeout of this node. @Oats87 is it possible? thanks.

cc @bk201 @https://github.com/starbops

Oct 14 '22 12:10 w13915984028

From the support-bundle attached in https://github.com/harvester/harvester/issues/2907#issue-1404168567,

in logs/cattle-system/rancher-59cd8bb8f7-hmmbq/rancher.log, it shows mutli-nodes are draining at same time

The first line of log from planner is: 2022-10-11T11:23:09.471682431Z 2022/10/11 11:23:09 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c, those 2 custom-929d403d1670,custom-c05d0d11190c are both control-plan nodes, they are draining at same time.

2022-10-11T11:23:09.471682431Z 2022/10/11 11:23:09 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:09.474096731Z 2022/10/11 11:23:09 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:10.762326566Z 2022/10/11 11:23:10 [ERROR] Failed to read API for groups map[autoscaling/v2:the server could not find the requested resource flowcontrol.apiserver.k8s.io/v1beta2:the server could not find the requested resource]
2022-10-11T11:23:14.327719194Z 2022/10/11 11:23:14 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:14.387075520Z 2022/10/11 11:23:14 [INFO] Watching metadata for autoscaling/v2beta1, Kind=HorizontalPodAutoscaler
2022-10-11T11:23:14.387115743Z 2022/10/11 11:23:14 [INFO] Stopping metadata watch on autoscaling/v1, Kind=HorizontalPodAutoscaler
2022-10-11T11:23:15.086488451Z 2022/10/11 11:23:15 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:15.086524671Z 2022/10/11 11:23:15 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:16.658870231Z 2022/10/11 11:23:16 [ERROR] Failed to handle tunnel request from remote address 10.52.1.49:53496: response 401: failed authentication
2022-10-11T11:23:19.447217540Z 2022/10/11 11:23:19 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:19.545281217Z 2022/10/11 11:23:19 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:21.681837805Z 2022/10/11 11:23:21 [ERROR] Failed to handle tunnel request from remote address 10.52.1.49:46494: response 401: failed authentication
2022-10-11T11:23:26.686803004Z 2022/10/11 11:23:26 [ERROR] Failed to handle tunnel request from remote address 10.52.1.49:46506: response 401: failed authentication
2022-10-11T11:23:29.436589166Z 2022/10/11 11:23:29 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c

node-0:~ # k -n fleet-local get machines
NAME                  CLUSTER   NODENAME   PROVIDERID      PHASE     AGE     VERSION
custom-2d94d5d682dc   local     node-3     rke2://node-3   Running   5d13h   // worker
custom-7c1afab6e79d   local     node-0     rke2://node-0   Running   5d14h
custom-929d403d1670   local     node-1     rke2://node-1   Running   5d14h  // control-plane
custom-c05d0d11190c   local     node-2     rke2://node-2   Running   5d13h  // control-plane

Oct 14 '22 19:10 w13915984028

Today, I did another round of Harvester upgrade on a 4-node cluster and tried my best to collect all the rancher pods' logs with a simple script while upgrading:

rancher-pod-logs.tar.gz

Before rancher upgrade:

rancher-7fd549bcc4-5twfg.txt
rancher-7fd549bcc4-j8m7r.txt
rancher-7fd549bcc4-dzdw8.txt

After rancher upgrade:

rancher-65f8899dfb-ksnkx.txt
rancher-65f8899dfb-58k7p.txt
rancher-65f8899dfb-87k5t.txt
rancher-65f8899dfb-26b6s.txt
rancher-65f8899dfb-nf7p4.txt
rancher-65f8899dfb-zrvsk.txt
rancher-65f8899dfb-qkvd6.txt
rancher-65f8899dfb-4mxd9.txt

In the middle of the upgrade, there was indeed a multi-node SchedulingDisable situation (node-1 & node-2) after the first node (node-0) was upgraded and rebooted. But we had a workaround code snippet deployed in the upgrade controller so the whole upgrade did not get stuck forever, it eventually went through to the end.

Here are some of the information that you can reference with the logs:

$ k -n fleet-local get machines
NAME                  CLUSTER   NODENAME   PROVIDERID      PHASE     AGE   VERSION
custom-1b287700d314   local     node-2     rke2://node-2   Running   25h  // control-plane
custom-57aefc97a78e   local     node-3     rke2://node-3   Running   25h  // worker
custom-ad79796f3d2a   local     node-1     rke2://node-1   Running   25h  // control-plane
custom-cd36cfbeabf7   local     node-0     rke2://node-0   Running   25h  // control-plane (bootstrap node)

Rancher upgrade from v2.6.4 to v2.6.9-rc5
RKE2 upgrade from v1.22.12+rke2r1 to v1.24.7+rke2r1

cc @w13915984028

Oct 20 '22 07:10 starbops

According to the source code

https://github.com/rancher/rancher/blob/release/v2.7/pkg/provisioningv2/rke2/planner/planner.go#L352

err = p.reconcile(controlPlane, clusterSecretTokens, plan, true, etcdTier, isEtcd, isInitNodeOrDeleting,		"1", joinServer,		controlPlane.Spec.UpgradeStrategy.ControlPlaneDrainOptions)
...
err = p.reconcile(controlPlane, clusterSecretTokens, plan, true, controlPlaneTier, isControlPlane, isInitNodeOrDeleting,		controlPlane.Spec.UpgradeStrategy.ControlPlaneConcurrency, joinServer,		controlPlane.Spec.UpgradeStrategy.ControlPlaneDrainOptions)

etcdTier, controlPlaneTier are fetched 1 in each tier

but they may share the same nodes (e.g. 3 management-node), thus breaks the control policy ControlPlaneConcurrency = "1"

it could be: after the init node is upgraded, it will upgrade another 2 in parallel. sometimes, it will be successful, sometimes not

@starbops Your last test log shows that.

Oct 26 '22 09:10 w13915984028

@Oats87 Are the above comments helpful, or do you still need a live environment reproducing this issue? Thanks!

Nov 01 '22 08:11 bk201

@bk201 I've been working to try and reproduce this but I have not been able to do so. Have you folks found an accurate reproducer for this?

Nov 23 '22 01:11 Oats87

@Oats87 We'll try to create one and get back to you. Thanks!

Dec 02 '22 06:12 bk201

Hi @Oats87, I successfully reproduced the issue on a 3-node Harvester cluster in our environment, though it's not always reproducible. I left the environment intact maybe you are interested in looking into it.

For simplicity and to avoid the lengthy upgrade process, I didn't trigger the normal upgrade flow of Harvester. Instead, I did the following (only upgrade RKE2):

Prepare a v1.0.3 Harvester cluster (RKE2 version is v1.22.12+rke2r1, Rancher version is v2.6.4-harvester3)
Upgrade Rancher to v2.6.9 with the following script

#!/usr/bin/env sh

set -ex

trap cleanup EXIT

cleanup() {
  if [ -n "$TEMP_DIR" ]; then
    \rm -vrf "$TEMP_DIR"
  fi
}

RANCHER_VERSION=${1:-v2.6.9}
TEMP_DIR=$(mktemp -d -p /tmp)

wharfie rancher/system-agent-installer-rancher:"$RANCHER_VERSION" "$TEMP_DIR"

pushd "$TEMP_DIR"
helm upgrade rancher ./rancher-"${RANCHER_VERSION#v}".tgz --reuse-values --set rancherImageTag="$RANCHER_VERSION"  --namespace cattle-system --wait
popd

kubectl -n cattle-system rollout status deploy rancher

Simulate the Harvester upgrade by patching clusters.provisioning.cattle.io with the command kubectl -n fleet-local patch clusters.provisioning.cattle.io local --type merge --patch-file ./upgrade-patch.yaml. The patch is like the following:

spec:
  kubernetesVersion: v1.24.7+rke2r1
  rkeConfig:
    provisionGeneration: 1
    upgradeStrategy:
      controlPlaneConcurrency: "1"
      workerConcurrency: "1"
      controlPlaneDrainOptions:
        deleteEmptyDirData: true
        enabled: true
        force: true
        ignoreDaemonSets: true
        postDrainHooks:
        - annotation: "harvesterhci.io/post-hook"
        preDrainHooks:
        - annotation: "harvesterhci.io/pre-hook"
      workerDrainOptions:
        deleteEmptyDirData: true
        enabled: true
        force: true
        ignoreDaemonSets: true
        postDrainHooks:
        - annotation: "harvesterhci.io/post-hook"
        preDrainHooks:
        - annotation: "harvesterhci.io/pre-hook"

The upgrade starts on the first node, harvester-node-0. The machine secret custom-7bb31dfaa3bb-machine-plan has rke.cattle.io/pre-drain annotated.
Manually annotate custom-7bb31dfaa3bb-machine-plan with harvesterhci.io/pre-hook just like the normal upgrade flow of Harvester does
The first node, harvester-node-0 starts to drain the pods
After the drain is done, custom-7bb31dfaa3bb-machine-plan is annotated with rke.cattle.io/post-drain
Manually annotate custom-7bb31dfaa3bb-machine-plan with harvesterhci.io/post-hook just like the normal upgrade flow of Harvester does
The first node upgrade is done
The upgrade start on the second node, harvester-node-2. The machine secret custom-bb2ddb6fb772-machine-plan has rke.cattle.io/pre-drain annotated.
Manually annotate custom-bb2ddb6fb772-machine-plan with harvesterhci.io/pre-hook
The second node, harvester-node-2 starts to drain the pods
The second node drain is done, custom-bb2ddb6fb772-machine-plan is annotated with rke.cattle.io/post-drain
Somehow, the third node, harvester-node-1 is cordoned off and has rke.cattle.io/pre-drain annotated

The support bundle is here:

supportbundle_12c2d5c7-956a-4e26-bebb-dd4ec43dc5d8_2022-12-09T06-14-20Z.zip

P.S. I have tried this iteration several times, but it did not happen the issue, until now. But it's more frequent when executing a regular Harvester upgrade.

Dec 09 '22 06:12 starbops

With trace logs enabled on rancher, I reproduced the issue with the same methods in the same environment. Here's the support bundle: supportbundle_12c2d5c7-956a-4e26-bebb-dd4ec43dc5d8_2022-12-20T03-21-52Z.zip

Hope that helps!

Dec 20 '22 03:12 starbops

I believe I have identified why this is occurring. Huge shout out to @starbops for helping me debug this/gathering me the corresponding logs for this.

https://github.com/rancher/rancher/pull/39101/commits/c6b6afd1d9147f8851505354dc0d1c0179faf2aa is a commit that introduces logic that attempts to continue determining draining status/update a plan if a plan has been applied but probes are failing. This seems to introduce an edge case where a valid but "old" plan may start having its probes fail (which is very possible to happen when the init node is restarted for example), causing the planner to attempt to drain that node.

I'll need to think of how to prevent this edge case while also accommodating the original desired business logic defined in the PR/commit.

Apr 21 '23 13:04 Oats87

https://github.com/rancher/rancher/pull/41459 reverts the addition of the planAppliedButWaitingForProbes short circuiting

May 15 '23 22:05 Oats87

We can confirm the issue doesn't happen recently after bumping to Rancher 2.7.5-rc releases; thanks!

Jun 16 '23 06:06 bk201

rancher rancher copied to clipboard

[BUG] Multiple server nodes pre-drains in an RKE2 upgrade

rancher
rancher copied to clipboard