calico icon indicating copy to clipboard operation
calico copied to clipboard

calico-node 3.27.2 fails to start on arm: libpcap.so.0.8: cannot open shared object file

Open mzhaase opened this issue 2 years ago • 16 comments

We upgraded from calico 3.27.0 to 3.27.2 due to https://github.com/projectcalico/calico/issues/8383. We upgraded by upgrading the tigera operator. Everything went smoothly except for calico-node on our arm servers. They go into CrashLoopBackOff with the following log entries:

Calico-node: error while loading shared libraries: libpcap.so.0.8: cannot open shared object file: No such file or directory
Calico node failed to start

Your Environment

  • Calico 3.27.2
  • Kubernetes 1.29.1

mzhaase avatar Feb 21 '24 17:02 mzhaase

Should fix by PR #8533 but unfortunately too late for v3.27.2 (was v3.27.1).

hjiawei avatar Feb 21 '24 20:02 hjiawei

@hjiawei is there any workaround using a newer container version for the node? If not, do you know if rolling back to 3.27.0 is save?

mzhaase avatar Feb 22 '24 15:02 mzhaase

Also waiting on this fix, not sure if you have a timeline for v3.27.3 release ?

pmcgrath-mck avatar Feb 22 '24 15:02 pmcgrath-mck

@hjiawei is there any workaround using a newer container version for the node? If not, do you know if rolling back to 3.27.0 is save?

Try to use it? 6

RyrieNorth avatar Feb 23 '24 07:02 RyrieNorth

Same error on fedora 39 on arm (only available package: libcap-ng, libcap.so.2, libcap.so.2.48

Tcharl avatar Feb 23 '24 12:02 Tcharl

While I'm just beating a dead horse - this is also broken on Bottlerocket on ARM hosts.

diranged avatar Feb 23 '24 14:02 diranged

Rolling back to 3.27.0 should be safe. 3.27.3 with the fix is expected in late March.

matthewdupre avatar Mar 05 '24 00:03 matthewdupre

adding myself to get notifications ^^

tibeer avatar Mar 05 '24 07:03 tibeer

Waiting for a fix too, rolling back to 3.27.0 worked fine on Apple Silicon powered VM.

Rolling back to 3.27.0 should be safe. 3.27.3 with the fix is expected in late March.

tmtiwari avatar Mar 10 '24 05:03 tmtiwari

Spent hours! Trying to get Calico to work on ARM hosts, only to eventually find this and rollback to 27.0, this is a nasty bug!

fraserds avatar Mar 12 '24 17:03 fraserds

Nice this is in progress, adding myself for future notification too Working on ubuntu arm64 after downgrade to 3.27.0

bensoille avatar Mar 15 '24 12:03 bensoille

This is high impact. Hope to see a release this week as promised. Thank you!

lprimak avatar Mar 25 '24 21:03 lprimak

As an end-user; Pulling 3.27.0 into my cluster worked while waiting for the regular dev-test-release cycle to complete. Aside from troubleshooting this has zero impact for me. Would be nice if the 3.27.2 release description had noted that there is a defect for arm64 and shouldn't be used with any arm64 clusters, I could have saved some time (along with others), devs could consider updating release notes for defects like this as a matter of practice (ie. "Known Issues" and a link to anything that is "breaking" in the release.

I'm unlikely to adopt future builds aggressively and will be waiting for downstream projects to dogfood a release before adoption. It's unclear if the devs released this knowing there was a defect, or, if the testing in place is insufficient to catch such an obvious problem. The latter is at least correctable, the former creates trust issues.

As a developer; Rushing is how mistakes are made, I have no inclination to rush other developers. We get it when it's ready. :+1: Thanks to the devs for all their hard work on this project and related projects! :clap:

wilson0x4d avatar Mar 27 '24 01:03 wilson0x4d

I disagree. This is a huge impact on new users. When a new user inevitably grabs the latest version for their clusters, they will tear their hair out for hours "why isn't my cluster working" on arm machines. Especially since this isn't documented in an appropriate place.

This isn't an issue with Calico alone, a lot of k8s ecosystem projects are like this. Currently, Rook's helm charts aren't working and it cost me 2 weeks to troubleshoot that (so far).

Generally, a lot of these projects are half-baked and not mature. Just an observation, not playing the blame game here.

Coming from the Java ecosystem where "everything works, even when always upgrading to latest" it just isn't the same with k8s ecosystem.

lprimak avatar Mar 27 '24 02:03 lprimak

I disagree. This is a huge impact on new users. When a new user inevitably grabs the latest version for their clusters, they will tear their hair out for hours "why isn't my cluster working" on arm machines. Especially since this isn't documented in an appropriate place.

This isn't an issue with Calico alone, a lot of k8s ecosystem projects are like this. Currently, Rook's helm charts aren't working and it cost me 2 weeks to troubleshoot that (so far).

Generally, a lot of these projects are half-baked and not mature. Just an observation, not playing the blame game here.

Coming from the Java ecosystem where "everything works, even when always upgrading to latest" it just isn't the same with k8s ecosystem.

I strongly agree, was pretty crazy from my side to detect the issue in the cluster after one week.

If this issue is a very well known one, because there are several duplicated issues with the same problem, at least the documentation page should be updated to install the version 3.27.0 and skip the problematic version.

josephrodriguez avatar Mar 27 '24 17:03 josephrodriguez

Our internal testing of ARM64 builds in particular has been minimal and we have largely been relying on community support for verification and maintenance of these builds until now. I can say that we are working on adding test environments running on ARM as we speak in order to prevent these types of issues in the future. It takes time and resources to build out full e2e runs for the various combinations of installer, cloud, distro, architecture, etc. So this will be an ongoing project to fill out the matrix. Thanks all for your help and patience to-date!

caseydavenport avatar Mar 27 '24 18:03 caseydavenport

Our internal testing of ARM64 builds in particular has been minimal and we have largely been relying on community support for verification and maintenance of these builds until now

testing efficacy aside, the thing that would really help everyone is that when a defect is discovered in a release leaving the release unusable, ie. no workaround other than to elect a different version, and a fix cannot be released within some reasonable time frame (you decide, but 24-48 hours seems fair IMO) the release page could be manually updated with a NOTE: Breaking change warning users, and if available, linking to the Issue(s) that discuss the break.

it's not an admittance of fault or a promise to fix, it's a respectable warning to the community that they may want to elect a different version if the Issue would affect them (or their downstream users.) it's better than taking a release down. it's better than combing through posts from frustrated users.

currently the release page for 3.27.2 makes no mention that it will not work with arm64 nodes, but this defect/Issue has been open for roughly a month. if the release notes for 3.27.2 linked to this issue myself and others could have saved time by electing a prior release and subscribing to this issue to wait for closure like normal people, and wouldn't have given it a second thought.

when i push new versions of critical components (and it doesn't get more critical than a CNI upgrade) i actually pull a couple nodes from my cluster (varying by OS version and platform architecture), reset them, and rejoin them, then verify logs and functionality. so i caught this defect literally within minutes of updating to 3.27.2 -- this is what good operators should be doing. even so, it still took me several hours before finding this github issue to know what i could safely roll back to. customers that don't properly test or have unreasonable expectations that everything in the world is going to be problem free are likely going to be more frustrated than i was, as a matter of fact i wasn't frustrated by this at all, more "worried" that an irreversible change may have been made (not the case) and I would be doing a full cluster rebuild on my weekend (the "big yeet" that never happened, thankfully.)

anyway, thanks for reading my wall of text. i'll not respond again, i just wanted to impress the value/importance of warning users on the release page to avoid some grief.

EDIT: ++ @danudey since he looks uniquely-invested in release-related doco.

wilson0x4d avatar Mar 31 '24 02:03 wilson0x4d

IMHO since the new release wasn't done this week as promised, the release / home page must have a warning

lprimak avatar Mar 31 '24 02:03 lprimak

Calico v3.27.3 went out today after a minor delay, thank you your patience and please let us know if this issue is resolved when you get a chance to test it out.

caseydavenport avatar Apr 02 '24 21:04 caseydavenport

Thanks @caseydavenport I just upgraded my cluster and it works.

lprimak avatar Apr 03 '24 01:04 lprimak

Fresh cluster installations with the new calico version work fine, too! Thanks a lot :)

tibeer avatar Apr 03 '24 05:04 tibeer

It works now, I just found it a bit weird that apparently the fix for this was available 1.5 months ago but even though it was a critical issue, it was done in a regular release instead of a hotfix.

mzhaase avatar Apr 03 '24 14:04 mzhaase

I understand frustration with the time to release here, and I 100% want it to be quicker. But as an engineering team, especially one working on a free open-source project with a larger number of different users and use-cases, we have a variety of pressures on our time and priorities. We don't delay releases because we want to.

caseydavenport avatar Apr 03 '24 16:04 caseydavenport