fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Rancher new cluster node registration failing

Open P-n-I opened this issue 1 year ago • 18 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

When trying to register a new node with a new downstream RKE2 cluster in Rancher 2.7.9 (also 2.7.5) we see the nodes plan Secret is never populated so the rancher-system-agent endlessly polls for a plan.

If we re-deploy the fleet-agent Deployment prior to creating the new downstream cluster definition in Rancher we can occasionally register nodes.

We have to re-deploy fleet-agent each time we need to create a new cluster, though this does not consistently work around the issue.

  • re-deploy fleet-agent Deployment on the Rancher cluster (k -n cattle-fleet-local-system rollout restart deployment fleet-agent)
  • create new downstream cluster definition
  • register node(s) to cluster

if the registration fails or we need to re-create the cluster we wipe the nodes, delete the cluster from Rancher and repeat the steps above.

From the fleet-controller logs when creating the downstream cluster named "test":

2024-01-09T14:14:27.714641430Z time="2024-01-09T14:14:27Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-test-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

The workaround of restarting the fleet-agent is not consistent, sometimes repeated manual loops of create cluster, register, delete cluster work.

Registration of nodes to k3s clusters appears to work, I've not tested that as much

Expected Behavior

We can create register nodes to newly created downstream clusters.

Steps To Reproduce

  • create new rke2 cluster
  • run registration command on cluster bootstrap node

Environment

- Architecture: x86_64
- Fleet Version: 1.7.1 and 1.8.1
- Cluster:
  - Provider: rke2
  - Options:
  - Kubernetes Version: v1.26.11+rke2r1

Logs

Logs from fleet-agent after a restart followed by a failed node registration:

I0109 14:34:16.884697       1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-local-system/fleet-agent-lock...
2024-01-09T14:34:20.761215643Z I0109 14:34:20.760567       1 leaderelection.go:258] successfully acquired lease cattle-fleet-local-system/fleet-agent-lock
2024-01-09T14:34:21.514842587Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
2024-01-09T14:34:21.515239711Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=Secret controller"
2024-01-09T14:34:21.515651076Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=Node controller"
2024-01-09T14:34:21.515921289Z time="2024-01-09T14:34:21Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
2024-01-09T14:34:22.245467409Z E0109 14:34:22.245355       1 memcache.go:206] couldn't get resource list for management.cattle.io/v3: 
time="2024-01-09T14:34:22Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=BundleDeployment controller"
time="2024-01-09T14:34:22Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:22Z" level=info msg="getting history for release fleet-agent-local"
time="2024-01-09T14:34:22Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:23Z" level=info msg="Deleting orphan bundle ID rke2, release kube-system/rke2-canal"
time="2024-01-09T14:34:24Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
time="2024-01-09T14:34:25Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"

Logs from fleet-agent after a restart, create new cluster and successful registration:

I0109 14:37:40.958163       1 leaderelection.go:248] attempting to acquire leader lease cattle-fleet-local-system/fleet-agent-lock...
2024-01-09T14:37:44.767848536Z I0109 14:37:44.767654       1 leaderelection.go:258] successfully acquired lease cattle-fleet-local-system/fleet-agent-lock
2024-01-09T14:37:45.799901278Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
2024-01-09T14:37:45.799938559Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=Secret controller"
2024-01-09T14:37:45.799944609Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=Node controller"
2024-01-09T14:37:45.799949489Z time="2024-01-09T14:37:45Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
E0109 14:37:45.966607       1 memcache.go:206] couldn't get resource list for management.cattle.io/v3: 
2024-01-09T14:37:45.991817525Z time="2024-01-09T14:37:45Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=BundleDeployment controller"
2024-01-09T14:37:45.992046547Z time="2024-01-09T14:37:45Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:46.002690980Z time="2024-01-09T14:37:46Z" level=info msg="getting history for release fleet-agent-local"
2024-01-09T14:37:46.255440243Z time="2024-01-09T14:37:46Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:47.041131051Z time="2024-01-09T14:37:47Z" level=info msg="Deleting orphan bundle ID rke2, release kube-system/rke2-canal"
2024-01-09T14:37:48.276516222Z time="2024-01-09T14:37:48Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"
2024-01-09T14:37:48.527326573Z time="2024-01-09T14:37:48Z" level=info msg="Deploying bundle cluster-fleet-local-local-1a3d67d0a899/fleet-agent-local"

Anything else?

Ref https://github.com/rancher/rancher/issues/43901 specifically https://github.com/rancher/rancher/issues/43901#issuecomment-1881021356

P-n-I avatar Jan 09 '24 14:01 P-n-I

From logs when creating the cluster in Rancher: fleet-agent

W0110 08:31:07.744207       1 reflector.go:442] pkg/mod/github.com/rancher/[email protected]/tools/cache/reflector.go:167: watch of *v1alpha1.BundleDeployment ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 7; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

fleet-controller

time="2024-01-10T08:31:02Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-dev-sandbox-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

P-n-I avatar Jan 10 '24 08:01 P-n-I

Contents of the clusters mcc bundle Chart.yaml value:

annotations:
  catalog.cattle.io/certified: rancher
  catalog.cattle.io/hidden: "true"
  catalog.cattle.io/kube-version: '>= 1.23.0-0 < 1.27.0-0'
  catalog.cattle.io/namespace: cattle-system
  catalog.cattle.io/os: linux
  catalog.cattle.io/permits-os: linux,windows
  catalog.cattle.io/rancher-version: '>= 2.7.0-0 < 2.8.0-0'
  catalog.cattle.io/release-name: system-upgrade-controller
apiVersion: v1
appVersion: v0.11.0
description: General purpose controller to make system level updates to nodes.
home: https://github.com/rancher/system-charts/blob/dev-v2.7/charts/rancher-k3s-upgrader
kubeVersion: '>= 1.23.0-0'
name: system-upgrade-controller
sources:
- https://github.com/rancher/system-charts/blob/dev-v2.7/charts/rancher-k3s-upgrader
version: 102.1.0+up0.5.0

downstream cluster we're seeing the issue with is v1.26.11+rke2r1

P-n-I avatar Jan 10 '24 11:01 P-n-I

debug log output from the fleet-controller when creating a new downstream cluster:


time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:09Z" level=debug msg="shorten bundle name test-managed-system-agent to test-managed-system-agent"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent' took 32.236433ms"
time="2024-01-11T11:08:09Z" level=debug msg="OnPurgeOrphaned for bundle 'test-managed-system-agent' change, checking if gitrepo still exists"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:09Z" level=debug msg="OnBundleChange for bundle 'test-managed-system-agent' took 183.411µs"
time="2024-01-11T11:08:09Z" level=debug msg="OnPurgeOrphaned for bundle 'test-managed-system-agent' change, checking if gitrepo still exists"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:10Z" level=debug msg="shorten bundle name mcc-test-managed-system-upgrade-controller to mcc-test-managed-system-upgrade-controller"
time="2024-01-11T11:08:10Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-test-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller' took 5.27411ms"
time="2024-01-11T11:08:10Z" level=debug msg="OnPurgeOrphaned for bundle 'mcc-test-managed-system-upgrade-controller' change, checking if gitrepo still exists"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller', checking targets, calculating changes, building objects"
time="2024-01-11T11:08:10Z" level=debug msg="OnBundleChange for bundle 'mcc-test-managed-system-upgrade-controller' took 289.752µs"
time="2024-01-11T11:08:10Z" level=debug msg="OnPurgeOrphaned for bundle 'mcc-test-managed-system-upgrade-controller' change, checking if gitrepo still exists"

P-n-I avatar Jan 11 '24 13:01 P-n-I

I don't know golang at all but I've been digging around trying to find out if there's something in our clusters that's wrong.

tag: release/v0.8.1+security1

controller.OnBundleChange controller.setResourceKey helmdeployer.Template (sets Helm defaults including useGlobalCfg: true and globalCfg.Capabilities to chartutil.DefaultCapabilities) Helm.Deploy Helm.install Helm.getCfg (if useGlobalCfg return globalCfg)

so at this point it's using the globalCfg which has default (1.20.0) as the kubeversion in Capabilities therefore Helm.install doesn't execute the cfg.RESTClientGetter.ToRESTMapper()

Can't find useGlobalCfg getting set anywhere other than true in Template so: I think it's unset when called via agent.manager so is the bool default: false

P-n-I avatar Jan 11 '24 15:01 P-n-I

I've done some local hacking of the code to add some logging and changed the fleet-controller Deployment to use our in-house hacked version.

this is the output when creating a new cluster called bobbins:

time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange bobbins-managed-system-agent"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange bobbins-managed-system-agent matchedTargets 0"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller matchedTargets 0"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 OnBundleChange mcc-bobbins-managed-system-upgrade-controller calling setResourceKey"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 in setResourceKey mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template with useGlobalCg : true"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template patched with useGlobalCg : true"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Template calling Deploy"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.Deploy"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.install for bundle mcc-bobbins-managed-system-upgrade-controller"
time="2024-01-12T15:22:34Z" level=info msg="ASK4 Helm.install cfg kubeversion v1.20.0"
time="2024-01-12T15:22:34Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-bobbins-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

P-n-I avatar Jan 12 '24 15:01 P-n-I

I hacked the fleet code to work round the "chart requires kubeVersion: >= 1.23.0-0" issue, created a new fleet-controller container updated the fleet-controller deployment to run my hacked container on our dev cluster and its made no difference to the problem of the machine-plan Secret not being populated with data.

That unrelated kubeVersion issue relates to the bundle mcc-<cluster>-managed-system-upgrade-controller.

The issue remains that the nodes custom-<id>-machine-plan Secret doesn't get populated so rancher-system-agent endlessly polls Rancher.

P-n-I avatar Jan 17 '24 15:01 P-n-I

rancher-system-agent output with CATTLE_AGENT_LOGLEVEL=debug

Jan 17 15:34:23 packer systemd[1]: Started Rancher System Agent.
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=debug msg="Instantiated new image utility with imagesDir: /var/lib/rancher/agent/images, imageCredentialProviderConfig: /var/lib/rancher/credentialprovider/config.yaml, imageCredentialProviderBinDir: /var/lib/rancher/credentialprovider/bin, agentRegistriesFile: /etc/rancher/agent/registries.yaml"
Jan 17 15:34:23 packer rancher-system-agent[18569]: time="2024-01-17T15:34:23Z" level=info msg="Starting remote watch of plans"
Jan 17 15:34:27 packer rancher-system-agent[18569]: E0117 15:34:27.619141   18569 memcache.go:206] couldn't get resource list for management.cattle.io/v3:
Jan 17 15:34:27 packer rancher-system-agent[18569]: time="2024-01-17T15:34:27Z" level=info msg="Starting /v1, Kind=Secret controller"
Jan 17 15:34:27 packer rancher-system-agent[18569]: time="2024-01-17T15:34:27Z" level=debug msg="[K8s] Processing secret custom-aede8c2b641f-machine-plan in namespace fleet-default at generation 0 with resource version 48393246"

and

k -n fleet-default get secret custom-aede8c2b641f-machine-plan
NAME                               TYPE                         DATA   AGE
custom-aede8c2b641f-machine-plan   rke.cattle.io/machine-plan   0      101s

P-n-I avatar Jan 17 '24 15:01 P-n-I

I'm having the exact same issue. Is there any workaround to get past this issue? Or maybe any specific version to use?

rgomez-eng avatar Jan 18 '24 21:01 rgomez-eng

I've not found a workaround, sometimes a registration works but mostly is stuck on the empty machine-plan for us.

P-n-I avatar Jan 19 '24 09:01 P-n-I

@rgomez-eng a long shot but; are you registering the node(s) with all three roles or have the problematic nodes got a sub-set of the etcd, controlplane and worker roles? Check the logs from the Rancher pods for occurences of

[INFO] [planner] rkecluster fleet-default/<CLUSTER NAME>: waiting for at least one control plane, etcd, and worker node to be registered

As it implies the node(s) with one of those roles isn't registered. Until each of the three roles is fulfilled by at least one registered node the cluster is not considered 'sane' and no node plan is delivered, therefore the rancher-system-agent endlessly polls for the plan Secret.

P-n-I avatar Jan 26 '24 13:01 P-n-I

We're not able to reliably re-create the issue and don't have the time to investigate further. Sometimes it just takes a while for the plan Secret to populate even though we have 6 nodes (3*etc/control, 3 worker) wanting to join.

P-n-I avatar Jan 29 '24 10:01 P-n-I

We're still seeing this issue:

fleet-default                                    custom-2a6a4339a879-machine-plan                           rke.cattle.io/machine-plan            0	 14m
fleet-default                                    custom-638e801c6183-machine-plan                           rke.cattle.io/machine-plan            0	 14m
fleet-default                                    custom-60117bd68fdd-machine-plan                           rke.cattle.io/machine-plan            0	 15m
fleet-default                                    custom-4623c9642380-machine-plan                           rke.cattle.io/machine-plan            0	 16m
fleet-default                                    custom-6aa4c775aee0-machine-plan                           rke.cattle.io/machine-plan            0	 16m
fleet-default                                    custom-a8dd42fd6fcc-machine-plan                           rke.cattle.io/machine-plan            0	 16m

image

P-n-I avatar Feb 15 '24 08:02 P-n-I

We use ansible to register the nodes (pull the registration cmd from the rancher api and run it on each node).

The first set of nodes to get registered are the ones with control and etcd roles. After they're registered we register worker nodes.

Rancher won't, by design, populate the machine-plans for the nodes until at least one node of all types is registered.

I tried manually registering 3 nodes that had all three roles but still see the machine-plan having 0 bytes.

P-n-I avatar Feb 15 '24 12:02 P-n-I

Does it still happen with Rancher 2.8.1 ?

kkaempf avatar Feb 19 '24 08:02 kkaempf

I've just upgraded to 2.8.2 and wanted to re-create the node in a single-node k3s cluster. I deleted the node from the downstream cluster when on 2.7.9. Got notified of your comment so upgraded to 2.8.2. Joining the node to the cluster is still stuck on and empty machine-plan Secret.

fleet-default                                    custom-624a3f13e536-machine-plan                           rke.cattle.io/machine-plan            0      8m32s

provisioning log:

[INFO ] waiting for infrastructure ready
[INFO ] waiting for at least one control plane, etcd, and worker node to be registered
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for bootstrap etcd to be available
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for agent to check in and apply initial plan
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-scheduler
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] non-ready bootstrap machine(s) custom-624a3f13e536 and join url to be available on bootstrap node
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] marking control plane as initialized and ready
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for plan to be applied
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: Node condition Ready is False., waiting for cluster agent to connect
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for plan to be applied
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for kubelet to update
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] custom-624a3f13e536
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) custom-624a3f13e536: waiting for cluster agent to connect
[INFO ] provisioning done
[INFO ] rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.

P-n-I avatar Feb 19 '24 11:02 P-n-I

Rancher 1.8.2, created a new k3s cluster, registered one node to it. Destroyed cluster and re-created with same name, ran k3s-uninstall on the node and joined it to the new cluster and we see, what looks like the old node from the first cluster attempt in the UI: image Note; the age of the cluster and working on v's the age of the node in error.

P-n-I avatar Mar 07 '24 14:03 P-n-I

We have an similar issue caused by an empty machine-plan for the new nodes in a new cluster. A workaround that helped was this:

  1. Run command for joining 1st Master (don't wait and get to the second step)
  2. Run command for joining 1st Worker. You will see 1st Master change his status from WaitingNodeRef
  3. Run command on 2nd and 3rd masters. After that cattle-cluster agents will come up and Worker1 change its status from WaitingNodeRef
  4. Join other workers

Suddenly the plan for master-1 was populated and the cluster bootstrapping started. We absolutely have no idea why the hell this works... Note: Fleet-Agent is in version 0.81

kfehrenbach avatar Mar 26 '24 05:03 kfehrenbach

We might have found a cause in our env: we uppped the timout on the the load balancer we have in front of the nodes running Rancher as we thought it was probably killing the rancher-system-agent watch of the websocket.

P-n-I avatar Jul 10 '24 11:07 P-n-I

@kfehrenbach your workaround sorted this issue out for me thanks, did you ever find a more permanent solution?

walker-tom avatar Dec 12 '24 11:12 walker-tom

I'm also having the issue custom-xyz-machine-plan secret is also empty. And the workaround above didn't help. Maybe because of older fleet-agent: v0.7.1.

Rancher v2.8.5 RKE2 v1.28.15+rke2r1

vonhutrong avatar Dec 24 '24 10:12 vonhutrong

Still getting this issue, workaround not helped so far. Rancher v2.8.5 rancher/fleet-agent:v0.9.5 RKE2: v1.28.15+rke2r1

Upgraded to Rancher v2.9.3 and seems to be exactly the same, one control node registers ok, triggering the other control nodes to try but none of the worker nodes (registered 10s after the control ones) Worker node machine plans:

k -n fleet-default get secret | grep machine-plan | sort -k 3 | head -n 10
custom-2214fa96e9a5-machine-plan                                      rke.cattle.io/machine-plan                    0      9m35s
custom-5c1309321ed5-machine-plan                                      rke.cattle.io/machine-plan                    0      9m35s
custom-72951cae746b-machine-plan                                      rke.cattle.io/machine-plan                    0      9m35s

I tried the workaround; register a single control and a worker quickly in succession but they both stay stuck on "Waiting for node ref"

I have managed to hack round this by registering a worker then a control node in quick succession then joining all the other nodes I need.

P-n-I avatar Jan 08 '25 11:01 P-n-I

What exactly do you mean by "in quick succession"? I have the same issue also with Rancher v2.8.5 and v1.28.15+rke2r1. Only machine-plan secret, that is being filled is the one from the control plane node(s).

maxnitze avatar Jan 08 '25 18:01 maxnitze

We register at least one worker and control within a second of each other, we use Ansible to do this so it's not a manual step for us. We have found that raising the idle timeout on the load balancer we have in front of the Rancher cluster has fixed the issue for us, a higher idle timeout seems to prevent the rancher-system-agent websocket getting closed.

P-n-I avatar Jan 09 '25 10:01 P-n-I

I tried that too. I also installed the nodes using Ansible, so they join in the same second, basically.

We don't have any LoadBalancer in front of Rancher, so nothing we could do there. Still, the macine-plan secret stays emtpy for all worker nodes :(

maxnitze avatar Jan 09 '25 10:01 maxnitze

Our helm installer now provides the kubernetes version to charts that it installs.

manno avatar Apr 02 '25 14:04 manno

Our helm installer now provides the kubernetes version to charts that it installs.

Not sure why this is relevant and why the issue is now closed?

We have the same problem -

New Workers hang joining the cluster. Their machine-plan Secret on the local cluster has no data and the Worker's systemd managed agent just sits watching it indefinitely

Also noted in #49806 under rancher/rancher

t80027t avatar May 08 '25 18:05 t80027t

Seeing this problem too. worker machine-plan secret on the local cluster have no data. Has anyone found other workarounds?

v1.31.9+rke2r1 
rancher-2.10-head

mondragonfx avatar Jun 18 '25 01:06 mondragonfx

Anyone get solutino for it, same issue on my cluster: v1.31.5+rke2r1 rancher-2.10.3

yuanyuefeng avatar Oct 13 '25 08:10 yuanyuefeng