calico-node 3.27.2 fails to start on arm: libpcap.so.0.8: cannot open shared object file
We upgraded from calico 3.27.0 to 3.27.2 due to https://github.com/projectcalico/calico/issues/8383. We upgraded by upgrading the tigera operator. Everything went smoothly except for calico-node on our arm servers. They go into CrashLoopBackOff with the following log entries:
Calico-node: error while loading shared libraries: libpcap.so.0.8: cannot open shared object file: No such file or directory
Calico node failed to start
Your Environment
- Calico 3.27.2
- Kubernetes 1.29.1
Should fix by PR #8533 but unfortunately too late for v3.27.2 (was v3.27.1).
@hjiawei is there any workaround using a newer container version for the node? If not, do you know if rolling back to 3.27.0 is save?
Also waiting on this fix, not sure if you have a timeline for v3.27.3 release ?
@hjiawei is there any workaround using a newer container version for the node? If not, do you know if rolling back to 3.27.0 is save?
Try to use it?
Same error on fedora 39 on arm (only available package: libcap-ng, libcap.so.2, libcap.so.2.48
While I'm just beating a dead horse - this is also broken on Bottlerocket on ARM hosts.
Rolling back to 3.27.0 should be safe. 3.27.3 with the fix is expected in late March.
adding myself to get notifications ^^
Waiting for a fix too, rolling back to 3.27.0 worked fine on Apple Silicon powered VM.
Rolling back to 3.27.0 should be safe. 3.27.3 with the fix is expected in late March.
Spent hours! Trying to get Calico to work on ARM hosts, only to eventually find this and rollback to 27.0, this is a nasty bug!
Nice this is in progress, adding myself for future notification too Working on ubuntu arm64 after downgrade to 3.27.0
This is high impact. Hope to see a release this week as promised. Thank you!
As an end-user; Pulling 3.27.0 into my cluster worked while waiting for the regular dev-test-release cycle to complete. Aside from troubleshooting this has zero impact for me. Would be nice if the 3.27.2 release description had noted that there is a defect for arm64 and shouldn't be used with any arm64 clusters, I could have saved some time (along with others), devs could consider updating release notes for defects like this as a matter of practice (ie. "Known Issues" and a link to anything that is "breaking" in the release.
I'm unlikely to adopt future builds aggressively and will be waiting for downstream projects to dogfood a release before adoption. It's unclear if the devs released this knowing there was a defect, or, if the testing in place is insufficient to catch such an obvious problem. The latter is at least correctable, the former creates trust issues.
As a developer; Rushing is how mistakes are made, I have no inclination to rush other developers. We get it when it's ready. :+1: Thanks to the devs for all their hard work on this project and related projects! :clap:
I disagree. This is a huge impact on new users. When a new user inevitably grabs the latest version for their clusters, they will tear their hair out for hours "why isn't my cluster working" on arm machines. Especially since this isn't documented in an appropriate place.
This isn't an issue with Calico alone, a lot of k8s ecosystem projects are like this. Currently, Rook's helm charts aren't working and it cost me 2 weeks to troubleshoot that (so far).
Generally, a lot of these projects are half-baked and not mature. Just an observation, not playing the blame game here.
Coming from the Java ecosystem where "everything works, even when always upgrading to latest" it just isn't the same with k8s ecosystem.
I disagree. This is a huge impact on new users. When a new user inevitably grabs the latest version for their clusters, they will tear their hair out for hours "why isn't my cluster working" on arm machines. Especially since this isn't documented in an appropriate place.
This isn't an issue with Calico alone, a lot of k8s ecosystem projects are like this. Currently, Rook's helm charts aren't working and it cost me 2 weeks to troubleshoot that (so far).
Generally, a lot of these projects are half-baked and not mature. Just an observation, not playing the blame game here.
Coming from the Java ecosystem where "everything works, even when always upgrading to latest" it just isn't the same with k8s ecosystem.
I strongly agree, was pretty crazy from my side to detect the issue in the cluster after one week.
If this issue is a very well known one, because there are several duplicated issues with the same problem, at least the documentation page should be updated to install the version 3.27.0 and skip the problematic version.
Our internal testing of ARM64 builds in particular has been minimal and we have largely been relying on community support for verification and maintenance of these builds until now. I can say that we are working on adding test environments running on ARM as we speak in order to prevent these types of issues in the future. It takes time and resources to build out full e2e runs for the various combinations of installer, cloud, distro, architecture, etc. So this will be an ongoing project to fill out the matrix. Thanks all for your help and patience to-date!
Our internal testing of ARM64 builds in particular has been minimal and we have largely been relying on community support for verification and maintenance of these builds until now
testing efficacy aside, the thing that would really help everyone is that when a defect is discovered in a release leaving the release unusable, ie. no workaround other than to elect a different version, and a fix cannot be released within some reasonable time frame (you decide, but 24-48 hours seems fair IMO) the release page could be manually updated with a NOTE: Breaking change warning users, and if available, linking to the Issue(s) that discuss the break.
it's not an admittance of fault or a promise to fix, it's a respectable warning to the community that they may want to elect a different version if the Issue would affect them (or their downstream users.) it's better than taking a release down. it's better than combing through posts from frustrated users.
currently the release page for 3.27.2 makes no mention that it will not work with arm64 nodes, but this defect/Issue has been open for roughly a month. if the release notes for 3.27.2 linked to this issue myself and others could have saved time by electing a prior release and subscribing to this issue to wait for closure like normal people, and wouldn't have given it a second thought.
when i push new versions of critical components (and it doesn't get more critical than a CNI upgrade) i actually pull a couple nodes from my cluster (varying by OS version and platform architecture), reset them, and rejoin them, then verify logs and functionality. so i caught this defect literally within minutes of updating to 3.27.2 -- this is what good operators should be doing. even so, it still took me several hours before finding this github issue to know what i could safely roll back to. customers that don't properly test or have unreasonable expectations that everything in the world is going to be problem free are likely going to be more frustrated than i was, as a matter of fact i wasn't frustrated by this at all, more "worried" that an irreversible change may have been made (not the case) and I would be doing a full cluster rebuild on my weekend (the "big yeet" that never happened, thankfully.)
anyway, thanks for reading my wall of text. i'll not respond again, i just wanted to impress the value/importance of warning users on the release page to avoid some grief.
EDIT: ++ @danudey since he looks uniquely-invested in release-related doco.
IMHO since the new release wasn't done this week as promised, the release / home page must have a warning
Calico v3.27.3 went out today after a minor delay, thank you your patience and please let us know if this issue is resolved when you get a chance to test it out.
Thanks @caseydavenport I just upgraded my cluster and it works.
Fresh cluster installations with the new calico version work fine, too! Thanks a lot :)
It works now, I just found it a bit weird that apparently the fix for this was available 1.5 months ago but even though it was a critical issue, it was done in a regular release instead of a hotfix.
I understand frustration with the time to release here, and I 100% want it to be quicker. But as an engineering team, especially one working on a free open-source project with a larger number of different users and use-cases, we have a variety of pressures on our time and priorities. We don't delay releases because we want to.