[BUG] Harvester after upgrade to v1.4.0-rc2 fails to bring back mcc-harvester bundle, harvester-network-controller-manager pods stuck in error and crashloop backoff
Describe the bug This likely will be encountered from v1.3.2 -> v1.4.0-rc2. But has been only currently experienced from:
- v1.4.0-rc1 -> v1.4.0-rc1
- v1.4.0-rc2 -> v1.4.0-rc2
Something expereience with in #3059 -> @starbops also experienced.
To Reproduce Steps to reproduce the behavior (pxe or baremetal):
- have a 2 node or greater cluster at v1.4.0-rc2
- upgrade to v1.4.0-rc2 (same version upgrade)
- allow upgrade to complete
- check bundles -> mcc-harvester isn't happy because harvester-network-controller-manager pods are stuck in error and backoff state
Expected behavior Upgrade to not break:
- mcc-harvester bundle
- replicaset harvester-network-controller-manager-uuid
- deployment harvester-network-controller-manager Not see errors in pods of replicaset for deployment harvester-network-controller-manager :
harvester-network-controller-manager-67dc895674-jpr7w F1009 23:24:47.313805 1 vlan.go:148] disable net.bridge.bridge-nf-call-iptables failed, error: open /proc/sys/net/bridge/bridge-n
f-call-iptables: read-only file system
harvester-network-controller-manager-67dc895674-pbnqj F1009 23:23:29.728257 1 vlan.go:148] disable net.bridge.bridge-nf-call-iptables failed, error: open /proc/sys/net/bridge/bridge-n
f-call-iptables: read-only file system
Support bundle supportbundle_4e913967-2061-4e17-8f4c-a8a1c7ca7ae7_2024-10-09T23-22-41Z.zip
Environment
- Harvester ISO version: v1.4.0-rc2
- Underlying Infrastructure: qemu\kvm or baremetal -> both replicatable
Additional context
This might relate to the change introduced in harvester/network-controller-harvester#83.
What's the output of sysctl net.bridge.bridge-nf-call-iptables?
How many times does the disable net.bridge.bridge-nf-call-iptables failed, error: open /proc/sys/net/bridge/bridge-n f-call-iptables: read-only file system show?
@mingshuoqiu - I unfortunately needed to use the hardware that the cluster was running. But the:
disable net.bridge.bridge-nf-call-iptables failed, error: open /proc/sys/net/bridge/bridge-n f-call-iptables: read-only file system
I believe showed once before the pod crashed and restarted? I can't recall exactly. But the deployment's replica set kept trying to start the pods, so the pods would continously try to come up, each time met with that error.
Normally, host-specific operations should be done by the agent pod, not the manager pod. So the error message appeared in the manager pods is just not right.
Note: Even in a newly provisioned v1.4.0-rc2 cluster, restarting the harvester-network-controller-manager deployment or simply killing off the pod causes the newly spawned pods to run into the CrashLoopBackOff state with the exact same error message.
Workaround: run sudo sysctl -w net.bridge.bridge-nf-call-iptables=0 on each node. The manager pods will be back in a short time.
You won't hit this issue for clusters upgrading from v1.3.2 and having at least one new cluster network created before the upgrade.
Pre Ready-For-Testing Checklist
-
[ ] If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted? The HEP PR is at:
-
[ ] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:
-
[ ] Is there a workaround for the issue? If so, where is it documented? The workaround is at:
-
[ ] Have the backend code been merged (harvester, harvester-installer, etc) (including
backport-needed/*)? The PR is at:-
[ ] Does the PR include the explanation for the fix or the feature?
-
[ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? The PR for the YAML change is at: The PR for the chart change is at:
-
-
[ ] If labeled: area/ui Has the UI issue filed or ready to be merged? The UI issue/PR is at:
-
[ ] If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged? The documentation/KB PR is at:
-
[ ] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?
- The automation skeleton PR is at:
- The automation test case PR is at:
-
[ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label
release/obsolete-compatibility? The compatibility issue is filed at:
Automation e2e test issue: harvester/tests#1612
@mingshuoqiu thank you for the fix :smile: :+1:
Validated on Version: v1.4-1856e9dc-head w/ 2 nodes -> also upgrade to Version: v1.4-1856e9dc-head that there were no issues. Restarting the network controller manager deployment also doesn't show any issues in the logs. The bundles all look great after upgrade too :smile: .
This looks good :+1:
I'll go ahead and close this out :+1: