cilium
cilium copied to clipboard
UDP port in-use detection inaccurate
Is there an existing issue for this?
- [X] I have searched the existing issues
What happened?
Randomly, we see an cilium-agent (primarily 1.10 but, I've found 1.8 agents as well hitting this) get stuck in CrashLoopBackoff logging:
level=info msg="Envoy: Starting xDS gRPC server listening on /var/run/cilium/xds.sock" subsys=envoy-manager
level=warning msg="Attempt to bind DNS Proxy failed, retrying in 4s" error="listen udp 0.0.0.0:33051: bind: address already in use" subsys=fqdn/dnsproxy
level=warning msg="Attempt to bind DNS Proxy failed, retrying in 4s" error="listen udp 0.0.0.0:33051: bind: address already in use" subsys=fqdn/dnsproxy
level=warning msg="Attempt to bind DNS Proxy failed, retrying in 4s" error="listen udp 0.0.0.0:33051: bind: address already in use" subsys=fqdn/dnsproxy
level=warning msg="Attempt to bind DNS Proxy failed, retrying in 4s" error="listen udp 0.0.0.0:33051: bind: address already in use" subsys=fqdn/dnsproxy
level=warning msg="Attempt to bind DNS Proxy failed, retrying in 4s" error="listen udp 0.0.0.0:33051: bind: address already in use" subsys=fqdn/dnsproxy
level=fatal msg="Error while creating daemon" error="listen udp 0.0.0.0:33051: bind: address already in use" subsys=daemon
From my investigation, I can see that on this host we have another process currently sending traces to the datadog-agent:
root@ip-1-2-3-4:~# lsof -i :33051
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
otherprocess 123456 nobody 8u IPv4 120034567 0t0 UDP localhost:33051->localhost:8125
In my review of the code, we do check /proc/net/udp
to see if the port is used but, with UDP I guess that's not enough as 33051 isn't there:
root@ip-1-2-3-4:~# grep 33051 /proc/net/udp
root@ip-1-2-3-4:~#
I believe unlike TCP, with UDP any use of a port makes it unavailable and sadly /proc/net/udp
appears to only have listeners in it, not other clients who are given an ephemeral port.
I'm not following the logic entirely but, I'm surprised that on each restart the cilium-agent continues to always try to get the same port in this condition. I can see it's somewhere in this and a result of the way proxyPorts
is used in https://github.com/cilium/cilium/blob/v1.10/pkg/proxy/proxy.go#L308-L325 but I'm not 100% sure where the state get's saved between reboots or perhaps we just generate the same random port to ask for?
Cilium Version
1.10.12 1.8.7
Kernel Version
Linux ip-10-52-1-34 5.13.0-1023-aws #25~20.04.1-Ubuntu SMP Mon Apr 25 19:28:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
v1.21.9
Sysdump
No response
Relevant log output
No response
Anything else?
I don't know what the best fix is here but, I have two thoughts:
- Update the used port logic for UDP to find clients too (maybe there's another /proc file?)
- If we fail to listen on the port, instead of retrying again for that specific one, increment it randomly to get lucky on another port or maybe set it to 0 and let the OS do it for us.
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for the report! For (1) I'm not sure there's not much we can do there. I believe that Cilium just attempts to listen on the same port, and if it's used, the kernel will respond with the address being in use. However, I agree that the suggestion (2) seems like a reasonable way to back-off / resolve this conflict.
@joestringer - Yeah no problem. I only noticed this since the agent's were in CrashLoopBackoff throwing those fatal
error's forever until I manually intervened. If other folks hit this, I'm just going to use --tofqdns-proxy-port
to force it to use a non-ephemeral, unused port in our infra.
@joestringer So, we need to implement the approach provided in the 2nd point in the suggestions about instead of retrying again for that specific one, increment it randomly to get lucky on another port or maybe set it to 0 and let the OS do it for us.
. Could you please provide some idea or code pointers about where the changes are needed to be done.
Hey @jaredledvina, Can I work on this issue
Sure, I'm not working on a fix for this issue at this time.
@joestringer Could you please take a look at this comment.
I would start by grepping the codebase for dns
or socket
. Most of the code in Cilium lives under pkg/
. From there you should be able to locate the code that is responsible for opening the DNS proxy ports. Note that there may be two sets of sockets - one to receive the DNS queries from the pods, and another one to send DNS queries to the upstream server. It sounds like this issue is about the first set of sockets.
@joestringer Do we need to increment the pp.proxyPort
in every iteration of this for loop? For preventing to retry again and again on the same proxy port.
https://github.com/cilium/cilium/blob/v1.10/pkg/proxy/proxy.go#L464
Hmm. Well, it will presumably depend on whether the user specified a proxy port via the configuration options or not. If the user configured a specific port, then they probably did that deliberately and it would be weird for this loop to choose a random other port. However, if the user did not specify a port then this seems reasonable.
@joestringer Is this change reasonable? I only increment pp.proxyPort only when we retry and option.Config.ToFQDNsProxyPort == 0
https://github.com/NikhilSharmaWe/cilium/commit/f6d664b1c140290294c5986c8f10f75890fc0dad