dashboard icon indicating copy to clipboard operation
dashboard copied to clipboard

Rancher server logs are spammed with logs when one of the node of a downstream RKE2 cluster is in `Reconciling` state (UI sends null)

Open dasarinaidu opened this issue 3 years ago • 18 comments

Rancher Server Setup

  • Rancher version: 2.7.0 upgraded to 2.7-head (2.7.2-rc6)
  • Installation option (Docker install/Helm Chart): Helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2 custom cluster
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version: v1.23.14+rke2r1 to v1.25.7+rke2r1
  • Cluster Type (Local/Downstream): Downstream Custom cluster(RKE2)

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions: Admin

Describe the bug Sometimes when performing an HA Upgrade to Rancher v2.7.0 TO v2.7.2-rc6, while having an AWS RKE2 custom cluster - the cluster will go into Updating state and the etcd nodes will go into a Reconciling state and the INFO logs are spammed and keep coming the same message.

To Reproduce

  1. Create a HA Rancher serve on v2.7.0
  2. Provision a downstream AWS RKE2 custom cluster(hardened) with k8 version : v1.23.14+rke2r1
  3. Perform an HA Upgrade on Rancher v2.7-head (v2.7.2-rc6)
  4. Make sure the upgrade success and downstream cluster came back active
  5. Upgrade downstream cluster kubarnetes version to to v1.25.7+rke2r1 and SAVE
  6. Check for the system behavior, Logs and cluster state

Result a. Downstream cluster initially came back active and then it went to update state with an error and it never came back to Active b. Rancher logs are spammed with INFO, these logs are keep coming for every second c. One of the node (etcd) is in Reconciling state and never came back with message on UI (Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet)

Expected Result a. Rancher logs should not be spammed with the logs every second.

** LOGS** 2023/03/17 18:36:58 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:36:58 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:03 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:03 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:08 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:08 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:13 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:13 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:18 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:18 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:23 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:23 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:28 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:28 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:33 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:33 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:38 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:38 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:43 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:43 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:48 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:48 [INFO] [planner] rkecluster fleet-default/dascrke2: waiting: configuring bootstrap node(s) custom-d953bccd4557: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd, kubelet 2023/03/17 18:37:51 [ERROR] error syncing 'fleet-default/custom-5e4c4afa8ddd': handler unmanaged-machine: the server was unable to return a response in the time allotted, but may still be processing the request (get nodes ip-172-31-2-93), requeuing

dasarinaidu avatar Mar 17 '23 18:03 dasarinaidu

This is a UI bug.

The root cause of the problem is the fact that we are delivering an invalid configuration to RKE2, namely specifying profile: null in machineGlobalConfig.

This is very easily reproducible -- create a new v2prov cluster in the UI, click "Edit as YAML", and note that profile: null under machineGlobalConfig.

Screenshot 2023-03-20 at 5 44 45 PM See attached screenshot.

Oats87 avatar Mar 21 '23 00:03 Oats87

@nwmac @gaktive , this is likely a side-effect of recent changes for profile behavior to accommodate k8s 1.25 changes so possibly a regression we don't want to happen in the upcoming 2.7.2 release. Marking as regression and release-blocker.

snasovich avatar Mar 21 '23 02:03 snasovich

@snasovich Backend should ignore the null profile. If the configuration is invalid, backend should reject the request.

nwmac avatar Mar 21 '23 08:03 nwmac

/fordwardport v2.7.next2

nwmac avatar Mar 21 '23 13:03 nwmac

@snasovich Backend should ignore the null profile. If the configuration is invalid, backend should reject the request.

@nwmac , I agree with this in principle, but it's still a change of the behavior on UI's part that we want to be reverted. I see it's already in test, thank you for a quick turn-around.

snasovich avatar Mar 21 '23 14:03 snasovich

https://github.com/rancher/rancher/issues/40942 - backend issue

Sahota1225 avatar Mar 21 '23 19:03 Sahota1225

Setting this ticket to Reopened until a Test Template can be provided.

cc @nwmac

jameson-mcghee avatar Mar 23 '23 20:03 jameson-mcghee

It was determined by management that this ticket does not need a Test Template to complete testing. Therefore I am setting this ticket back to QA Working.

jameson-mcghee avatar Mar 24 '23 14:03 jameson-mcghee

During testing upgrading from v2.7.0 to v2.7-head (Commit ID: 41b2cf3), I did not observe the issue with the spammed logs when following the Steps to Reproduce. Therefore I am closing this ticket as Done.

Test Steps:

  1. Create a HA Rancher serve on v2.7.0
  2. Provision a Hardened downstream RKE2 custom cluster with K8s v1.23.14+rke2r1 (depreciated)
  3. Perform an HA Upgrade on Rancher v2.7-head
  4. Make sure the upgrade success and downstream cluster came back active
  5. Upgrade downstream cluster Kubernetes version to to v1.25.7+rke2r1 and SAVE
  6. Check for the system behavior, Logs and cluster state

Screenshots: image image image image

Rancher Logs: Rancher Logs

jameson-mcghee avatar Mar 24 '23 16:03 jameson-mcghee

Reopen for some more validations.

sowmyav27 avatar Mar 25 '23 15:03 sowmyav27

Ticket #8480 - Test Results - ✅

With Docker on a single-node instance of Rancher v2.7.0 => v2.7.2-rc6:

❗ Unable to reproduce ❗

Scenario Test Case Result
1. Provision downstream rke2 w/ k8s 1.23 on 2.7.0
2. Upgrade Rancher from 2.7.0 => 2.7.2-rc6, then upgrade cluster's k8s version to 1.25; observe profile field of machineGlobalConfig is left out of request
3. Provision downstream rke2 w/ k8s 1.25 on v2.7.2-rc6

Scenario 1:

  1. Fresh install of rancher v2.7.0
  2. Provision a downstream rke2 k8s 1.23 cluster
  3. Verified - cluster successfully provisions; logs not flooded

Scenario 2:

  1. Pick up where scenario 1 ^^ left off - upgrade rancher to v2.7.2-rc6
  2. Once upgraded to v2.7.2-rc6, upgrade rke2 cluster k8s version to 1.25, opening the network tab and observe profile field of machineGlobalConfig is left out of request (since it is set to null)
  3. Verified - cluster successfully upgrades; logs not flooded; request accurately matches UI changes

Scenario 3:

  1. Fresh install of rancher v2.7.2-rc6
  2. Provision a downstream rke2 k8s 1.25 cluster
  3. Verified - cluster successfully provisions; logs not flooded

Josh-Diamond avatar Mar 28 '23 20:03 Josh-Diamond

Re-opening due to the following concerns:

  • profile: null is still present when navigating to Cluster Management > View YAML

The fix that went in, which did resolve the flooded logs symptom the issue reported, seems to have removed profile: null from the request body sent from the UI. The field still persists and is seen in View YAML from rancher UI. This is unexpected. Re-opening to request this field gets hidden behind #

Josh-Diamond avatar Apr 05 '23 16:04 Josh-Diamond

To clarify here, I would expect that I can copy-paste the YAML that comes out of the UI's Edit as YAML (or View as YAML) interface and directly kubectl apply|create it in my cluster and end up with an identical result.

If the UI is selectively ignoring fields when marshalling the object and sending it over the wire, that is not ideal, although it does I suppose provide a resolution to the originally reported bug.

The behavior I would expect to see is

Scenario A: If the user does not define a profile for their cluster and edits the cluster as YAML, there should be no profile key at all in the machineGlobalConfig, or if there is, it should be commented out like we do the rest of the fields that do not have explicitly set values, for example see bucket in the screenshot here: https://github.com/rancher/dashboard/issues/8480#issuecomment-1477180460

Scenario B: If the user defines a profile for their cluster, I would expect to see the profile: cis-1.5 entry under machineGlobalConfig

There is no case in which null is a valid value for profile, but profile is a valid configuration parameter for RKE2. We do not have a mechanism to perform value validation at this point for RKE2/K3s (that is a feature enhancement on the K3s/RKE2 side), and thus on the backend side would have to hard code some type of validation to ensure that null is not being delivered.

Oats87 avatar Apr 05 '23 16:04 Oats87

This isn't a blocker for 2.7.2, which is now in the 11th hour of its release, and the related backend ticket is also pushed out. Moving to Q2.

gaktive avatar Apr 06 '23 16:04 gaktive

The remaining work is to comment out the profile key when editing as yaml - this is currently driven off of data that comes from the backend via the schema.

Bumping out of 2.7.next2 - tis requires for investigation and will require test automation

nwmac avatar May 22 '23 13:05 nwmac

@Oats87 we should coordinate with you and/or your team about how UI & backend should handle this.

gaktive avatar Jun 08 '23 13:06 gaktive

cc @Sahota1225

gaktive avatar Jun 08 '23 13:06 gaktive

This may be something the UI is doing when trying to populate fields we expect, so we should scrub upon save.

gaktive avatar Sep 06 '23 15:09 gaktive