calico
calico copied to clipboard
Calico node in unprivileged mode tries to write /proc/sys for accept_ra
When running calico-node unprivileged we see tons of the following messages in the logs (tens of millions per day)
calico-node-dxmj5 calico-node 2022-01-02 12:08:38.007 [INFO][34] felix/endpoint_mgr.go 1179: Applying /proc/sys configuration to interface. ifaceName="enid6a600c4d5b"
calico-node-dxmj5 calico-node 2022-01-02 12:08:38.007 [WARNING][34] felix/endpoint_mgr.go 716: Failed to configure interface, will retry error=
Expected Behavior
- Calico node does not try to change the value of "/proc/sys/net/ipv6/conf/%s/accept_ra" when it is already zero.
- If changing the setting fails a warning is logged once.
Current Behavior
Millions of warnings are logged per day even when the setting is already correct.
Possible Solution
Felix endpoint_mgr should:
- only try to set accept_ra to 0 if it is not already 0 (which is try in our case) see https://github.com/projectcalico/calico/blob/master/felix/dataplane/linux/endpoint_mgr.go#L1169
- only log a warning or error once if it fails. There is no use in retrying as it will always fails if the container doesn't have the privileges
Steps to Reproduce (for bugs)
- Run calico-node unprivileged
Context
As a best practice we want to run containers with minimal privileges.
Your Environment
- Calico version: docker.io/calico/node:v3.21.2
- Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes
- Operating System and version: linux
- Link to your project (optional):
I see a related error regarding sysctl setting with IPv4 neighbor parameters in #5341 which would be fixed by some simple changes to error handling (also running in unpriv container). Looks like the logic around detecting unchangeable sysctls could use some changes. Recent Kubernetes already has done this (see feature gate KubeletInUserNamespace=true).
Now that I tested setting that parameter in one of my worker nodes, it works:
echo 0 | sudo tee /proc/sys/net/ipv6/conf/cali*/accept_ra 0
I am running the container as a Proxmox unpriv container with "nested=1", which apparently does some changes around /proc and /sys handling. Might be worth a shot looking at how they are doing the nesting (for a quick workaround maybe).
Thanks for your reply. Unfortunately we don't have that setting available to us in a docker/kubernetes setup.
If the settings in configureInterface() are not critical. We could change the logic somewhat to only attempt to write them if the process has write access. There seem to be some oddities between checking for ipv4 interfaces while writing ipv6 settings anyway from line 1171.
Looking at the logic at wlIfaceNamesToReconfigure on line 708 https://github.com/projectcalico/calico/blob/a89955005f08bc4415674bd692f787649e6d9136/felix/dataplane/linux/endpoint_mgr.go#L1160 it could potentially refactored somewhat into:
- check if interface exists
- check if a setting if the interface is writeable (here my assumption is that they are either all writeable or none)
- if not, log debug message with "cannot configure interface %s, /proc/sys is read only"
- if yes, call configureInterface
If this make sense, I'm happy try to provide a patch.
Might need some shaving, testing and love but this would be my general gist: https://github.com/projectcalico/calico/pull/5350
@schans Please can you explain your use case in more detail? calico-node needs certain privileges to do what it does, so running completely unprivileged isn't a useful option.
CC @lmm ; i think he was working on something similar.
@neiljerram Happy to elaborate. We would like to run calico-node as non-privileged if possible. We are using the tigera operator to deploy calico on EKS in AWS. We followed the documentation at https://projectcalico.docs.tigera.io/security/non-privileged and add the the "nonPrivileged: enabled" setting to the installation crd.
Everything seems to be working fine but we noticed a stream of errors/warnings in the logs as calico-node keeps trying to change some settings in /proc/sys for the network interfaces but doesn't have the permissions to do so. We managed to change the log level to error by setting "logSeverityScreen: Error" on the default felixconfiguration (maybe there should a nicer way through the operator).
So maybe the first thing to analyze/discuss is whether with the restrictions specified in the documentation it is supported and makes sense to run non-privileged at all. The caveat mentioned in the documentation is a bit vague to be honest:
"The tradeoff for more security is the overhead of Calico networking management. For example, you no longer receive Calico corrections to misconfigurations caused by other components within your cluster, along with limited support for new features"
If running calico with these restrictions is a valid option then it makes sense to address the issue I raised as it looks like the calico keeps trying to write to /proc/sys in a "loop" because it errors out.
For me it is difficult to judge how important for normal operations the changes are that calico tries to make to the (interface configs) in /proc/sys.
HTH!
Worth to mention we're looking for only NetworkPolicies feature and IPAM is still handled by aws-vpc-cni.
Many thanks @schans @michalschott . @lmm Does this align with your work?
@neiljerram I wrote the original lines of code to disable accept_ra a while ago, but had a quick chat with @mgleung who worked on adding non-privileged support. Given #5341 we'll probably want something to handle setting sysctls more gracefully when running Calico non-privileged.
Thanks for the PR @schans , I'll take a look at that.
Hi all and many thanks to @schans for providing a potential solution. I am facing the same issue (AWS EKS with AWS VNC-CNI and non-privileged calico-node). @fasaxc and @caseydavenport Are there any updates regarding the PR? Could you elaborate what is still missing?
From the latest comments, looks like that PR:
- Has some failing tests that need to be fixed up
- Needs a rebase to fix some conflicts
- Has some feedback from @fasaxc that needs to be addressed.
Ran into this issue as well using tigera-operator helm chart v3.24.1 It looks like that PR has been closed as 'working as expected', while we still see tons of these messages in the logs.
In my case I can confirm that all ifaces are already set to 0 under /proc/sys/net/ipv6/conf/eni-*/accept_ra
.
My setup is EKS v1.23 based, networkpolicy only (IPAMD on aws-vpc-cni). Other than for the superfluous logging, everything seems to be working OK.
This issue is also one of the blocker for us (https://gardener.cloud/) from running calico-node in unprivileged mode by default.
Sorry, "working as expected" is probably an oversimplification of the issue. I believe we needed to address the issue of concurrent interface deletion for a fix to be accepted. Any takers?
Hello, this issue is a blocker for us as were applying Pod Security Policies. Bump =)