cilium
cilium copied to clipboard
Problems with coredns timeouts and pods DNS resolution with bpf.masquerade enabled
Is there an existing issue for this?
- [X] I have searched the existing issues
What happened?
After enabling bpf.masquerade=true, coredns starts timeouting and other pods can't resolve anything.
Cilium Version
Client: 1.15.4 9b3f9a8c 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/arm64 Daemon: 1.15.4 9b3f9a8c 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/arm64
Kernel Version
Linux dev-control-plane 6.6.26-linuxkit #1 SMP Sat Apr 27 04:13:19 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
Kubernetes Version
Client Version: v1.30.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.2
Regression
No response
Sysdump
cilium-sysdump-20240512-205923.zip
Relevant log output
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:57283->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:38103->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:53718->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:33906->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:34466->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:60107->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:34493->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:41721->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:38282->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:35967->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:43732->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:45840->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:33932->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:38568->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:38284->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:45192->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:34840->10.100.0.254:53: i/o timeout │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:32915->10.100.0.254:53: i/o timeout │
random pod log:
nginx@test-5dd9d7b595-786r7:/$ curl google.com
curl: (6) Could not resolve host: google.com
I install Cilium with this:
helm upgrade --install cilium cilium/cilium \
--namespace kube-system \
--set cluster.name=$CLUSTER_NAME \
--set kubeProxyReplacement=true \
--set ipv4.enabled=true \
--set ipv6.enabled=false \
--set k8sServiceHost=$CLUSTER_NAME-control-plane \
--set k8sServicePort=6443 \
--set ipam.mode=cluster-pool \
--set ipam.operator.clusterPoolIPv4PodCIDRList="10.42.0.0/16" \
--set ipam.operator.clusterPoolIPv4MaskSize=24 \
--set k8s.requireIPv4PodCIDR=true \
--set autoDirectNodeRoutes=true \
--set routingMode=native \
--set endpointRoutes.enabled=true \
--set ipv4NativeRoutingCIDR="10.0.0.0/8" \
--set bpf.tproxy=true \
--set bpf.preallocateMaps=true \
--set bpf.hostLegacyRouting=false \
--set bpf.masquerade=true \
--set enableIPv4Masquerade=true \
--set encryption.enabled=true \
--set encryption.type=wireguard \
--set encryption.nodeEncryption=true \
--set encryption.strictMode.enabled=true \
--set encryption.strictMode.cidr="10.0.0.0/8" \
--set encryption.strictMode.allowRemoteNodeIdentities=true \
--set rollOutCiliumPods=true \
--set operator.rollOutPods=true
cilium status output:
root@dev-worker2:/home/cilium# cilium status
KVStore: Ok Disabled
Kubernetes: Ok 1.29 (v1.29.2) [linux/arm64]
Kubernetes APIs: ["EndpointSliceOrEndpoint", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumCIDRGroup", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: True [eth0 172.18.0.2 fc00:f853:ccd:e793::2 fe80::42:acff:fe12:2 (Direct Routing)]
Host firewall: Disabled
SRv6: Disabled
CNI Chaining: none
Cilium: Ok 1.15.4 (v1.15.4-9b3f9a8c)
NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 2/254 allocated from 10.42.2.0/24,
IPv4 BIG TCP: Disabled
IPv6 BIG TCP: Disabled
BandwidthManager: Disabled
Host Routing: BPF
Masquerading: BPF [eth0] 10.0.0.0/8 [IPv4: Enabled, IPv6: Disabled]
Controller Status: 18/18 healthy
Proxy Status: OK, ip 10.42.2.223, 0 redirects active on ports 10000-20000, Envoy: embedded
Global Identity Range: min 256, max 65535
Hubble: Ok Current/Max Flows: 137/4095 (3.35%), Flows/s: 1.83 Metrics: Disabled
Encryption: Wireguard [NodeEncryption: Enabled, cilium_wg0 (Pubkey: vQfrUsFvKKYFvplB8kScoY0EAl5F6YLRYkYB/DbILnw=, Port: 51871, Peers: 2)]
Cluster health: 3/3 reachable (2024-05-12T19:12:44Z)
Modules Health: Stopped(0) Degraded(0) OK(11) Unknown(3)
Anything else?
everything works fine until bpf.masquerade is enabled. That feature alone is the issue as I tried number of different configurations. My environment is latest kind cluster running on Docker for Mac.
Cilium Users Document
- [ ] Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
I'm not able to reproduce this. I installed a kind cluster with bpf.masquerade and it works as expected.
Did you try changing this setting on a running cluster, or was it from scratch?
I create a cluster from scratch each time.
In your tests, are you able to resolve anything from some test pod, like alpine packages repo?
I tried with your exact setup -- except for on linux -- and it worked perfectly. There must be some kind of strange discrepancy -- maybe mac is the issue?
One strange thing I see is this line in cilium-dbg status:
Encryption: Wireguard [NodeEncryption: OptedOut, cilium_wg0 (Pubkey: XXX, Port: 51871, Peers: 2)]
whereas on my cluster, I see
Encryption: Wireguard [NodeEncryption: Enabled, cilium_wg0 (Pubkey: XXXX, Port: 51871, Peers: 1)]
Not sure if that's potentially an issue. What happens if you disable encryption?
I noticed encryption status come and go as I make changes to values file and apply changes by doing helm upgrade. By default it's enabled and works fine.
I suspect the issue is Mac thing as well, just not sure how to debug it. I guess it's much complex setup on MAcs than on Linux because of Docker Desktop's underlying VM. WOuld be great to have some documentation dealing with that test case.
Yeah, at the end of the day, docker on mac is not really a supported platform; it's useful for development -- and many Cilium developers use it! But I'm not sure that we have the expertise to dig in to these sorts of issues.
So last night I had some movements. Aparently BPF masquerading works but if native routing is changed to tunnel mode.
Let's say my setup has:
- Docker subnet CIDR: 10.100.0.0/24
- nodes CIDR: 172.18.0.0/24
- pods CIDR: 10.42.0.0/16
- services CIDR: 10.43.0.0/16
What would be the correct value for ipv4NativeRoutingCIDR?
Perhaps that's causing the issue on my end.
Most importantly, are bpf.masquerading and routingMode:native supposed to be used together?
So I ran into this article where apparently, coredns configmap needs to have fixed nameserver instead of relying on /etc/resolv.conf (not sure why though).
After I tried this, there was no codedns timeout errors and traffic flows as expected.
I used this config:
cluster:
name: dev
kubeProxyReplacement: true
ipv4:
enabled: true
ipv6:
enabled: false
k8sServiceHost: dev-control-plane
k8sServicePort: 6443
ipam:
mode: cluster-pool
operator:
clusterPoolIPv4PodCIDRList: "10.42.0.0/16" # Pods CIDR
clusterPoolIPv4MaskSize: 24
k8s:
requireIPv4PodCIDR: true
autoDirectNodeRoutes: true
routingMode: native
endpointRoutes:
enabled: true
ipv4NativeRoutingCIDR: "10.42.0.0/16" # Pods CIDR
bpf:
tproxy: true
preallocateMaps: true
hostLegacyRouting: false
masquerade: true
ipMasqAgent:
enabled: true
config:
nonMasqueradeCIDRs:
- 10.42.0.0/16 # Pods CIDR
enableIPv4Masquerade: true
encryption:
enabled: true
type: wireguard
nodeEncryption: true
strictMode:
enabled: true
cidr: "10.42.0.0/16" # Pods CIDR
allowRemoteNodeIdentities: true
externalIPs:
enabled: true
nodePort:
enabled: true
hostPort:
enabled: true
hubble:
enabled: true
relay:
enabled: true
rollOutPods: true
ui:
enabled: true
rollOutPods: true
rollOutCiliumPods: true
operator:
rollOutPods: true
Other than this, I'd really appreciate if there are any conflicts or missconfiguration in terms of CIDRs I used in chart values that I'm not aware of.
I have seen this problem and confirmed it. The problem is that docker adds the following rules to iptables/nftables
$ sudo iptables -t nat -S DOCKER_OUTPUT
-N DOCKER_OUTPUT
-A DOCKER_OUTPUT -d 172.18.0.1/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:40721
-A DOCKER_OUTPUT -d 172.18.0.1/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:38796
But when bpfMasquerade is happening these rules are never hit.
$ sudo nft list table ip nat
...
chain DOCKER_OUTPUT {
ip daddr 172.18.0.1 tcp dport 53 counter packets 0 bytes 0 dnat to 127.0.0.11:40721
ip daddr 172.18.0.1 udp dport 53 counter packets 128 bytes 9338 dnat to 127.0.0.11:38796
}
Those 128 packets were generated by me, testing from the kind node. This problem is specific to Docker and its use of netfilter which bpf is bypassing.
Ive been able to go away with using public resolver like 1.1.1.1 in coredns configmap instead of forwarding everything to /etc/resolv.conf.
But when bpfMasquerade is happening these rules are never hit.
Could you check with bpf.hostLegacyRouting=true ?
Ive been able to go away with using public resolver like 1.1.1.1 in coredns configmap instead of forwarding everything to /etc/resolv.conf.
You can do that, or you can change resolv.conf on the node and restart the coredns pods. Either one works around the problem but this should be better handled, probably by fixing the Docker DNS configuration (if possible) through Kind config
Could you check with bpf.hostLegacyRouting=true
It was observed when it was true. It's true right now, and
root@dnsutils:/# dig @172.18.0.1 ipquail.com
;; communications error to 172.18.0.1#53: timed out
;; communications error to 172.18.0.1#53: timed out
@julianwiedmann look here, the CI tests for Cilium note this problem and hack around it https://github.com/cilium/cilium/blob/main/contrib/scripts/kind.sh#L250-L255
This was originally documented in #23283 and was marked resolved by #30321 but that only fixes it for people using the Cilium CI scripts, as noted in #31118 so really it's not solved at all
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
I have the same problem, but with Talos and ARM servers.
But when bpfMasquerade is happening these rules are never hit.
Could you check with
bpf.hostLegacyRouting=true?
Good see ! that's fix issue for my context
talos on arm hcloud VM with this cilium values:
values.yaml
prometheus: &prome
enabled: false
serviceMonitor:
trustCRDsExist: true
enabled: true
k8sServiceHost: 127.0.0.1
k8sServicePort: 7445
ipam:
mode: kubernetes
routingMode: native
ipv4NativeRoutingCIDR: 10.0.0.0/16
loadBalancer:
mode: dsr
bpf:
hostLegacyRouting: true
masquerade: true
envoy:
enabled: true
prometheus: *prome
encryption:
enabled: true
type: wireguard
nodeEncryption: true
kubeProxyReplacement: true
localRedirectPolicy: true
operator:
prometheus: *prome
replicas: 2
hubble:
relay:
enabled: true
prometheus: *prome
ui:
enabled: true
rollOutPods: true
podLabels:
traefik.home.arpa/ingress: allow
metrics:
enableOpenMetrics: true
enabled:
- dns:query
- drop
- tcp
- flow
- port-distribution
- icmp
- http
resources: # for agent
limits:
memory: 1Gi
### required for Talos ###
securityContext:
capabilities:
ciliumAgent:
- CHOWN
- KILL
- NET_ADMIN
- NET_RAW
- IPC_LOCK
- SYS_ADMIN
- SYS_RESOURCE
- DAC_OVERRIDE
- FOWNER
- SETGID
- SETUID
cleanCiliumState: [NET_ADMIN, SYS_ADMIN, SYS_RESOURCE]
cgroup:
autoMount:
enabled: false
hostRoot: /sys/fs/cgroup
logOptions:
format: json
With local-cache-dns LRP setup.
Guys, I am experiencing this issue with a clean Talos Linux installation, 3 control plane nodes, hosted on Proxmox. I have public and private networks (eth0 / eth1 respectively).
Here's the configuration I am applying:
helm template cilium cilium/cilium \
--version 1.16.3 \
--namespace kube-system \
--set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
--set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
--set cgroup.autoMount.enabled=false \
--set cgroup.hostRoot=/sys/fs/cgroup \
--set kubeProxyReplacement=true \
--set k8sServiceHost=127.0.0.1 \
--set k8sServicePort=7445 \
--set ipv4.enabled=true \
--set ipv4NativeRoutingCIDR="10.244.0.0/16" \
--set ipam.operator.clusterPoolIPv4PodCIDRList="10.244.0.0/16" \
--set devices="eth0 eth1" \
--set routingMode=native \
--set autoDirectNodeRoutes=true \
--set bpf.masquerade=true \
--set bpf.hostLegacyRouting=true \
--set bpf.datapathMode=veth \
--set enableIPv4Masquerade=true \
> cilium.yaml
I see DNS resolution failures from: cilium connectivity test
.[=] [cilium-test-1] Test [no-policies] [2/102]
...................
ℹ️ curl stdout:
:0 -> :0 = 000
ℹ️ curl stderr:
curl: (28) Resolving timed out after 2001 milliseconds
curl: (28) Resolving timed out after 2001 milliseconds
curl: (28) Resolving timed out after 2001 milliseconds
curl: (28) Resolving timed out after 2001 milliseconds
kubectl logs --namespace=kube-system -l k8s-app=kube-dns
[INFO] 10.244.0.139:55368 - 14666 "A IN one.one.one.one. udp 44 false 1232" - - 0 2.00174321s
[ERROR] plugin/errors: 2 one.one.one.one. A: read udp 10.244.0.220:60449->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:49914 - 8235 "AAAA IN one.one.one.one. udp 44 false 1232" - - 0 2.000752971s
[ERROR] plugin/errors: 2 one.one.one.one. AAAA: read udp 10.244.0.220:39737->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:49914 - 49060 "A IN one.one.one.one. udp 44 false 1232" - - 0 2.000909955s
[ERROR] plugin/errors: 2 one.one.one.one. A: read udp 10.244.0.220:37276->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:49914 - 49060 "A IN one.one.one.one. udp 44 false 1232" - - 0 2.000824036s
[ERROR] plugin/errors: 2 one.one.one.one. A: read udp 10.244.0.220:40001->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:49914 - 8235 "AAAA IN one.one.one.one. udp 44 false 1232" - - 0 2.001309189s
[ERROR] plugin/errors: 2 one.one.one.one. AAAA: read udp 10.244.0.220:37683->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:35172 - 36517 "AAAA IN one.one.one.one. udp 44 false 1232" - - 0 2.000911343s
[ERROR] plugin/errors: 2 one.one.one.one. AAAA: read udp 10.244.0.22:35798->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:53829 - 54548 "AAAA IN one.one.one.one. udp 44 false 1232" - - 0 2.000969259s
[ERROR] plugin/errors: 2 one.one.one.one. AAAA: read udp 10.244.0.22:56461->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:53829 - 25924 "A IN one.one.one.one. udp 44 false 1232" - - 0 2.000747851s
[ERROR] plugin/errors: 2 one.one.one.one. A: read udp 10.244.0.22:44417->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:53829 - 54548 "AAAA IN one.one.one.one. udp 44 false 1232" - - 0 2.001070964s
[ERROR] plugin/errors: 2 one.one.one.one. AAAA: read udp 10.244.0.22:49799->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:53829 - 25924 "A IN one.one.one.one. udp 44 false 1232" - - 0 2.001301272s
[ERROR] plugin/errors: 2 one.one.one.one. A: read udp 10.244.0.22:55786->169.254.116.108:53: i/o timeout
Cilium tests are now working for my environment, however it didn't work when I recreated the yaml using bpf.hostLegacyRouting=true and applied it to a cluster that had Cilium installed without this setting.
I have found that the only reliable way to test different Cilium settings is to restore the cluster from backup before Cilium is installed, and then install Cilium fresh with the new settings. Whaetver logic Cilium has to update existing settings appears to behave differently to a fresh installation. Some settings appear to take effect after deleting the Cilium pods and letting Kubernetes recreate them, however I have not found this approach to be reliable for all configuration changes.
Is this a known issue, or is it documented somewhere what configuration settings have been tested and confirmed can be applied to an existing installation (happy to raise a new issue for this if you prefer)?
I think this is the underlying cause of my challenges: https://github.com/cilium/cilium/issues/29413
@rkerno did you restart all cilium-agent* pods after applying modification ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.