harvester icon indicating copy to clipboard operation
harvester copied to clipboard

[BUG] Harvester after upgrade to v1.4.0-rc2 fails to bring back mcc-harvester bundle, harvester-network-controller-manager pods stuck in error and crashloop backoff

Open irishgordo opened this issue 1 year ago • 6 comments

Describe the bug This likely will be encountered from v1.3.2 -> v1.4.0-rc2. But has been only currently experienced from:

  • v1.4.0-rc1 -> v1.4.0-rc1
  • v1.4.0-rc2 -> v1.4.0-rc2

Something expereience with in #3059 -> @starbops also experienced.

To Reproduce Steps to reproduce the behavior (pxe or baremetal):

  1. have a 2 node or greater cluster at v1.4.0-rc2
  2. upgrade to v1.4.0-rc2 (same version upgrade)
  3. allow upgrade to complete
  4. check bundles -> mcc-harvester isn't happy because harvester-network-controller-manager pods are stuck in error and backoff state

Expected behavior Upgrade to not break:

  • mcc-harvester bundle
  • replicaset harvester-network-controller-manager-uuid
  • deployment harvester-network-controller-manager Not see errors in pods of replicaset for deployment harvester-network-controller-manager :
harvester-network-controller-manager-67dc895674-jpr7w F1009 23:24:47.313805       1 vlan.go:148] disable net.bridge.bridge-nf-call-iptables failed, error: open /proc/sys/net/bridge/bridge-n
f-call-iptables: read-only file system                                                                                                                                                       
harvester-network-controller-manager-67dc895674-pbnqj F1009 23:23:29.728257       1 vlan.go:148] disable net.bridge.bridge-nf-call-iptables failed, error: open /proc/sys/net/bridge/bridge-n
f-call-iptables: read-only file system        

Support bundle supportbundle_4e913967-2061-4e17-8f4c-a8a1c7ca7ae7_2024-10-09T23-22-41Z.zip

Environment

  • Harvester ISO version: v1.4.0-rc2
  • Underlying Infrastructure: qemu\kvm or baremetal -> both replicatable

Additional context Screenshot from 2024-10-09 16-21-22 Screenshot from 2024-10-09 16-19-47 Screenshot from 2024-10-09 16-19-10 Screenshot from 2024-10-09 16-18-48 Screenshot from 2024-10-09 16-18-32

irishgordo avatar Oct 09 '24 23:10 irishgordo

This might relate to the change introduced in harvester/network-controller-harvester#83.

starbops avatar Oct 11 '24 02:10 starbops

What's the output of sysctl net.bridge.bridge-nf-call-iptables? How many times does the disable net.bridge.bridge-nf-call-iptables failed, error: open /proc/sys/net/bridge/bridge-n f-call-iptables: read-only file system show?

mingshuoqiu avatar Oct 11 '24 02:10 mingshuoqiu

@mingshuoqiu - I unfortunately needed to use the hardware that the cluster was running. But the:

disable net.bridge.bridge-nf-call-iptables failed, error: open /proc/sys/net/bridge/bridge-n f-call-iptables: read-only file system

I believe showed once before the pod crashed and restarted? I can't recall exactly. But the deployment's replica set kept trying to start the pods, so the pods would continously try to come up, each time met with that error.

irishgordo avatar Oct 11 '24 02:10 irishgordo

Normally, host-specific operations should be done by the agent pod, not the manager pod. So the error message appeared in the manager pods is just not right.

starbops avatar Oct 11 '24 03:10 starbops

Note: Even in a newly provisioned v1.4.0-rc2 cluster, restarting the harvester-network-controller-manager deployment or simply killing off the pod causes the newly spawned pods to run into the CrashLoopBackOff state with the exact same error message.

starbops avatar Oct 11 '24 08:10 starbops

Workaround: run sudo sysctl -w net.bridge.bridge-nf-call-iptables=0 on each node. The manager pods will be back in a short time.


You won't hit this issue for clusters upgrading from v1.3.2 and having at least one new cluster network created before the upgrade.

starbops avatar Oct 15 '24 09:10 starbops

Pre Ready-For-Testing Checklist

  • [ ] If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted? The HEP PR is at:

  • [ ] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:

  • [ ] Is there a workaround for the issue? If so, where is it documented? The workaround is at:

  • [ ] Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)? The PR is at:

    • [ ] Does the PR include the explanation for the fix or the feature?

    • [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? The PR for the YAML change is at: The PR for the chart change is at:

  • [ ] If labeled: area/ui Has the UI issue filed or ready to be merged? The UI issue/PR is at:

  • [ ] If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged? The documentation/KB PR is at:

  • [ ] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility? The compatibility issue is filed at:

Automation e2e test issue: harvester/tests#1612

@mingshuoqiu thank you for the fix :smile: :+1: Validated on Version: v1.4-1856e9dc-head w/ 2 nodes -> also upgrade to Version: v1.4-1856e9dc-head that there were no issues. Restarting the network controller manager deployment also doesn't show any issues in the logs. The bundles all look great after upgrade too :smile: . This looks good :+1: I'll go ahead and close this out :+1: Screenshot from 2024-10-21 17-56-28 Screenshot from 2024-10-21 17-54-34

irishgordo avatar Oct 22 '24 00:10 irishgordo