dashboard Rancher server logs are spammed with logs when one of the node of a downstream RKE2 cluster is in `Reconciling` state (UI sends null)

Rancher Server Setup

Rancher version: 2.7.0 upgraded to 2.7-head (2.7.2-rc6)
Installation option (Docker install/Helm Chart): Helm
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2 custom cluster
Proxy/Cert Details:

Information about the Cluster

Kubernetes version: v1.23.14+rke2r1 to v1.25.7+rke2r1
Cluster Type (Local/Downstream): Downstream Custom cluster(RKE2)

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- If custom, define the set of permissions: Admin

Describe the bug Sometimes when performing an HA Upgrade to Rancher v2.7.0 TO v2.7.2-rc6, while having an AWS RKE2 custom cluster - the cluster will go into Updating state and the etcd nodes will go into a Reconciling state and the INFO logs are spammed and keep coming the same message.

To Reproduce

Create a HA Rancher serve on v2.7.0
Provision a downstream AWS RKE2 custom cluster(hardened) with k8 version : v1.23.14+rke2r1
Perform an HA Upgrade on Rancher v2.7-head (v2.7.2-rc6)
Make sure the upgrade success and downstream cluster came back active
Upgrade downstream cluster kubarnetes version to to v1.25.7+rke2r1 and SAVE
Check for the system behavior, Logs and cluster state

Result a. Downstream cluster initially came back active and then it went to update state with an error and it never came back to Active b. Rancher logs are spammed with INFO, these logs are keep coming for every second c. One of the node (etcd) is in Reconciling state and never came back with message on UI (Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet)

Expected Result a. Rancher logs should not be spammed with the logs every second.

** LOGS** 2023/03/17 18:36:58 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:36:58 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:03 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:03 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:08 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:08 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:13 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:13 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:18 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:18 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:23 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:23 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:28 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:28 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:33 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:33 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:38 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:38 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:43 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:43 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:48 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:48 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:51 [ERROR] error syncing 'fleet-default/custom-5e4c4afa8ddd': handler unmanaged-machine: the server was unable to return a response in the time allotted, but may still be processing the request (get nodes ip-172-31-2-93), requeuing

Mar 17 '23 18:03 dasarinaidu

This is a UI bug.

The root cause of the problem is the fact that we are delivering an invalid configuration to RKE2, namely specifying profile: null in machineGlobalConfig.

This is very easily reproducible -- create a new v2prov cluster in the UI, click "Edit as YAML", and note that profile: null under machineGlobalConfig.

See attached screenshot.

Mar 21 '23 00:03 Oats87

@nwmac @gaktive , this is likely a side-effect of recent changes for profile behavior to accommodate k8s 1.25 changes so possibly a regression we don't want to happen in the upcoming 2.7.2 release. Marking as regression and release-blocker.

Mar 21 '23 02:03 snasovich

@snasovich Backend should ignore the null profile. If the configuration is invalid, backend should reject the request.

Mar 21 '23 08:03 nwmac

/fordwardport v2.7.next2

Mar 21 '23 13:03 nwmac

@snasovich Backend should ignore the null profile. If the configuration is invalid, backend should reject the request.

@nwmac , I agree with this in principle, but it's still a change of the behavior on UI's part that we want to be reverted. I see it's already in test, thank you for a quick turn-around.

Mar 21 '23 14:03 snasovich

https://github.com/rancher/rancher/issues/40942 - backend issue

Mar 21 '23 19:03 Sahota1225

Setting this ticket to Reopened until a Test Template can be provided.

cc @nwmac

Mar 23 '23 20:03 jameson-mcghee

It was determined by management that this ticket does not need a Test Template to complete testing. Therefore I am setting this ticket back to QA Working.

Mar 24 '23 14:03 jameson-mcghee

During testing upgrading from v2.7.0 to v2.7-head (Commit ID: 41b2cf3), I did not observe the issue with the spammed logs when following the Steps to Reproduce. Therefore I am closing this ticket as Done.

Test Steps:

Create a HA Rancher serve on v2.7.0
Provision a Hardened downstream RKE2 custom cluster with K8s v1.23.14+rke2r1 (depreciated)
Perform an HA Upgrade on Rancher v2.7-head
Make sure the upgrade success and downstream cluster came back active
Upgrade downstream cluster Kubernetes version to to v1.25.7+rke2r1 and SAVE
Check for the system behavior, Logs and cluster state

Screenshots:

Rancher Logs: Rancher Logs

Mar 24 '23 16:03 jameson-mcghee

Reopen for some more validations.

Mar 25 '23 15:03 sowmyav27

Ticket #8480 - Test Results - ✅

With Docker on a single-node instance of Rancher v2.7.0 => v2.7.2-rc6:

❗ Unable to reproduce ❗

Scenario	Test Case	Result
1.	Provision downstream rke2 w/ k8s 1.23 on 2.7.0	✅
2.	Upgrade Rancher from `2.7.0 => 2.7.2-rc6`, then upgrade cluster's k8s version to `1.25`; observe `profile` field of `machineGlobalConfig` is left out of request	✅
3.	Provision downstream rke2 w/ k8s 1.25 on `v2.7.2-rc6`	✅

Scenario 1:

Fresh install of rancher v2.7.0
Provision a downstream rke2 k8s 1.23 cluster
Verified - cluster successfully provisions; logs not flooded

Scenario 2:

Pick up where scenario 1 ^^ left off - upgrade rancher to v2.7.2-rc6
Once upgraded to v2.7.2-rc6, upgrade rke2 cluster k8s version to 1.25, opening the network tab and observe profile field of machineGlobalConfig is left out of request (since it is set to null)
Verified - cluster successfully upgrades; logs not flooded; request accurately matches UI changes

Scenario 3:

Fresh install of rancher v2.7.2-rc6
Provision a downstream rke2 k8s 1.25 cluster
Verified - cluster successfully provisions; logs not flooded

Mar 28 '23 20:03 Josh-Diamond

Re-opening due to the following concerns:

profile: null is still present when navigating to Cluster Management > View YAML

The fix that went in, which did resolve the flooded logs symptom the issue reported, seems to have removed profile: null from the request body sent from the UI. The field still persists and is seen in View YAML from rancher UI. This is unexpected. Re-opening to request this field gets hidden behind #

Apr 05 '23 16:04 Josh-Diamond

To clarify here, I would expect that I can copy-paste the YAML that comes out of the UI's Edit as YAML (or View as YAML) interface and directly kubectl apply|create it in my cluster and end up with an identical result.

If the UI is selectively ignoring fields when marshalling the object and sending it over the wire, that is not ideal, although it does I suppose provide a resolution to the originally reported bug.

The behavior I would expect to see is

Scenario A: If the user does not define a profile for their cluster and edits the cluster as YAML, there should be no profile key at all in the machineGlobalConfig, or if there is, it should be commented out like we do the rest of the fields that do not have explicitly set values, for example see bucket in the screenshot here: https://github.com/rancher/dashboard/issues/8480#issuecomment-1477180460

Scenario B: If the user defines a profile for their cluster, I would expect to see the profile: cis-1.5 entry under machineGlobalConfig

There is no case in which null is a valid value for profile, but profile is a valid configuration parameter for RKE2. We do not have a mechanism to perform value validation at this point for RKE2/K3s (that is a feature enhancement on the K3s/RKE2 side), and thus on the backend side would have to hard code some type of validation to ensure that null is not being delivered.

Apr 05 '23 16:04 Oats87

This isn't a blocker for 2.7.2, which is now in the 11th hour of its release, and the related backend ticket is also pushed out. Moving to Q2.

Apr 06 '23 16:04 gaktive

The remaining work is to comment out the profile key when editing as yaml - this is currently driven off of data that comes from the backend via the schema.

Bumping out of 2.7.next2 - tis requires for investigation and will require test automation

May 22 '23 13:05 nwmac

@Oats87 we should coordinate with you and/or your team about how UI & backend should handle this.

Jun 08 '23 13:06 gaktive

cc @Sahota1225

Jun 08 '23 13:06 gaktive

This may be something the UI is doing when trying to populate fields we expect, so we should scrub upon save.

Sep 06 '23 15:09 gaktive