cilium Problems with coredns timeouts and pods DNS resolution with bpf.masquerade enabled

Is there an existing issue for this?

[X] I have searched the existing issues

What happened?

After enabling bpf.masquerade=true, coredns starts timeouting and other pods can't resolve anything.

Cilium Version

Client: 1.15.4 9b3f9a8c 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/arm64 Daemon: 1.15.4 9b3f9a8c 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/arm64

Kernel Version

Linux dev-control-plane 6.6.26-linuxkit #1 SMP Sat Apr 27 04:13:19 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Kubernetes Version

Client Version: v1.30.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.2

Regression

No response

Sysdump

cilium-sysdump-20240512-205923.zip

Relevant log output

[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:57283->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:38103->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:53718->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:33906->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:34466->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:60107->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:34493->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:41721->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:38282->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:35967->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:43732->10.100.0.254:53: i/o timeout                                                                                          │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:45840->10.100.0.254:53: i/o timeout                                                                                       │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:33932->10.100.0.254:53: i/o timeout                                                                                          │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:38568->10.100.0.254:53: i/o timeout                                                                                       │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:38284->10.100.0.254:53: i/o timeout                                                                                       │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:45192->10.100.0.254:53: i/o timeout                                                                                          │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:34840->10.100.0.254:53: i/o timeout                                                                                       │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:32915->10.100.0.254:53: i/o timeout                                                                                          │


random pod log:
nginx@test-5dd9d7b595-786r7:/$ curl google.com
curl: (6) Could not resolve host: google.com

I install Cilium with this:

helm upgrade --install cilium cilium/cilium \
  --namespace kube-system \
  --set cluster.name=$CLUSTER_NAME \
  --set kubeProxyReplacement=true \
  --set ipv4.enabled=true \
  --set ipv6.enabled=false \
  --set k8sServiceHost=$CLUSTER_NAME-control-plane \
  --set k8sServicePort=6443 \
  --set ipam.mode=cluster-pool \
  --set ipam.operator.clusterPoolIPv4PodCIDRList="10.42.0.0/16" \
  --set ipam.operator.clusterPoolIPv4MaskSize=24 \
  --set k8s.requireIPv4PodCIDR=true \
  --set autoDirectNodeRoutes=true \
  --set routingMode=native \
  --set endpointRoutes.enabled=true \
  --set ipv4NativeRoutingCIDR="10.0.0.0/8" \
  --set bpf.tproxy=true \
  --set bpf.preallocateMaps=true \
  --set bpf.hostLegacyRouting=false \
  --set bpf.masquerade=true \
  --set enableIPv4Masquerade=true \
  --set encryption.enabled=true \
  --set encryption.type=wireguard \
  --set encryption.nodeEncryption=true \
  --set encryption.strictMode.enabled=true \
  --set encryption.strictMode.cidr="10.0.0.0/8" \
  --set encryption.strictMode.allowRemoteNodeIdentities=true \
  --set rollOutCiliumPods=true \
  --set operator.rollOutPods=true

cilium status output:

root@dev-worker2:/home/cilium# cilium status
KVStore:                 Ok   Disabled
Kubernetes:              Ok   1.29 (v1.29.2) [linux/arm64]
Kubernetes APIs:         ["EndpointSliceOrEndpoint", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumCIDRGroup", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:    True   [eth0    172.18.0.2 fc00:f853:ccd:e793::2 fe80::42:acff:fe12:2 (Direct Routing)]
Host firewall:           Disabled
SRv6:                    Disabled
CNI Chaining:            none
Cilium:                  Ok   1.15.4 (v1.15.4-9b3f9a8c)
NodeMonitor:             Listening for events on 8 CPUs with 64x4096 of shared memory
Cilium health daemon:    Ok
IPAM:                    IPv4: 2/254 allocated from 10.42.2.0/24,
IPv4 BIG TCP:            Disabled
IPv6 BIG TCP:            Disabled
BandwidthManager:        Disabled
Host Routing:            BPF
Masquerading:            BPF   [eth0]   10.0.0.0/8 [IPv4: Enabled, IPv6: Disabled]
Controller Status:       18/18 healthy
Proxy Status:            OK, ip 10.42.2.223, 0 redirects active on ports 10000-20000, Envoy: embedded
Global Identity Range:   min 256, max 65535
Hubble:                  Ok              Current/Max Flows: 137/4095 (3.35%), Flows/s: 1.83   Metrics: Disabled
Encryption:              Wireguard       [NodeEncryption: Enabled, cilium_wg0 (Pubkey: vQfrUsFvKKYFvplB8kScoY0EAl5F6YLRYkYB/DbILnw=, Port: 51871, Peers: 2)]
Cluster health:          3/3 reachable   (2024-05-12T19:12:44Z)
Modules Health:          Stopped(0) Degraded(0) OK(11) Unknown(3)

Anything else?

everything works fine until bpf.masquerade is enabled. That feature alone is the issue as I tried number of different configurations. My environment is latest kind cluster running on Docker for Mac.

Cilium Users Document

[ ] Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

[X] I agree to follow this project's Code of Conduct

May 12 '24 19:05 pentago

I'm not able to reproduce this. I installed a kind cluster with bpf.masquerade and it works as expected.

Did you try changing this setting on a running cluster, or was it from scratch?

May 15 '24 08:05 squeed

I create a cluster from scratch each time.

In your tests, are you able to resolve anything from some test pod, like alpine packages repo?

May 15 '24 09:05 pentago

I tried with your exact setup -- except for on linux -- and it worked perfectly. There must be some kind of strange discrepancy -- maybe mac is the issue?

One strange thing I see is this line in cilium-dbg status:

Encryption:                           Wireguard       [NodeEncryption: OptedOut, cilium_wg0 (Pubkey: XXX, Port: 51871, Peers: 2)]

whereas on my cluster, I see

Encryption:              Wireguard       [NodeEncryption: Enabled, cilium_wg0 (Pubkey: XXXX, Port: 51871, Peers: 1)]

Not sure if that's potentially an issue. What happens if you disable encryption?

May 15 '24 11:05 squeed

I noticed encryption status come and go as I make changes to values file and apply changes by doing helm upgrade. By default it's enabled and works fine.

I suspect the issue is Mac thing as well, just not sure how to debug it. I guess it's much complex setup on MAcs than on Linux because of Docker Desktop's underlying VM. WOuld be great to have some documentation dealing with that test case.

May 15 '24 11:05 pentago

Yeah, at the end of the day, docker on mac is not really a supported platform; it's useful for development -- and many Cilium developers use it! But I'm not sure that we have the expertise to dig in to these sorts of issues.

May 16 '24 12:05 squeed

So last night I had some movements. Aparently BPF masquerading works but if native routing is changed to tunnel mode.

Let's say my setup has:

Docker subnet CIDR: 10.100.0.0/24
nodes CIDR: 172.18.0.0/24
pods CIDR: 10.42.0.0/16
services CIDR: 10.43.0.0/16

What would be the correct value for ipv4NativeRoutingCIDR?

Perhaps that's causing the issue on my end.

Most importantly, are bpf.masquerading and routingMode:native supposed to be used together?

May 16 '24 12:05 pentago

So I ran into this article where apparently, coredns configmap needs to have fixed nameserver instead of relying on /etc/resolv.conf (not sure why though).

After I tried this, there was no codedns timeout errors and traffic flows as expected.

I used this config:

cluster:
  name: dev

kubeProxyReplacement: true
ipv4:
  enabled: true
ipv6:
  enabled: false

k8sServiceHost: dev-control-plane
k8sServicePort: 6443

ipam:
  mode: cluster-pool
  operator:
    clusterPoolIPv4PodCIDRList: "10.42.0.0/16"  # Pods CIDR
    clusterPoolIPv4MaskSize: 24

k8s:
  requireIPv4PodCIDR: true

autoDirectNodeRoutes: true
routingMode: native
endpointRoutes:
  enabled: true

ipv4NativeRoutingCIDR: "10.42.0.0/16"  # Pods CIDR

bpf:
  tproxy: true
  preallocateMaps: true
  hostLegacyRouting: false
  masquerade: true

ipMasqAgent:
  enabled: true
  config:
    nonMasqueradeCIDRs:
      - 10.42.0.0/16 # Pods CIDR

enableIPv4Masquerade: true

encryption:
  enabled: true
  type: wireguard
  nodeEncryption: true
  strictMode:
    enabled: true
    cidr: "10.42.0.0/16"  # Pods CIDR
    allowRemoteNodeIdentities: true

externalIPs:
  enabled: true

nodePort:
  enabled: true

hostPort:
  enabled: true

hubble:
  enabled: true
  relay:
    enabled: true
    rollOutPods: true
  ui:
    enabled: true
    rollOutPods: true

rollOutCiliumPods: true
operator:
  rollOutPods: true

Other than this, I'd really appreciate if there are any conflicts or missconfiguration in terms of CIDRs I used in chart values that I'm not aware of.

May 16 '24 15:05 pentago

I have seen this problem and confirmed it. The problem is that docker adds the following rules to iptables/nftables

$ sudo iptables -t nat -S DOCKER_OUTPUT
-N DOCKER_OUTPUT
-A DOCKER_OUTPUT -d 172.18.0.1/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:40721
-A DOCKER_OUTPUT -d 172.18.0.1/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:38796

But when bpfMasquerade is happening these rules are never hit.

$ sudo  nft list table ip nat
...
chain DOCKER_OUTPUT {
		ip daddr 172.18.0.1 tcp dport 53 counter packets 0 bytes 0 dnat to 127.0.0.11:40721
		ip daddr 172.18.0.1 udp dport 53 counter packets 128 bytes 9338 dnat to 127.0.0.11:38796
	}

Those 128 packets were generated by me, testing from the kind node. This problem is specific to Docker and its use of netfilter which bpf is bypassing.

May 30 '24 07:05 jorhett

Ive been able to go away with using public resolver like 1.1.1.1 in coredns configmap instead of forwarding everything to /etc/resolv.conf.

May 30 '24 08:05 pentago

But when bpfMasquerade is happening these rules are never hit.

Could you check with bpf.hostLegacyRouting=true ?

May 30 '24 09:05 julianwiedmann

Ive been able to go away with using public resolver like 1.1.1.1 in coredns configmap instead of forwarding everything to /etc/resolv.conf.

You can do that, or you can change resolv.conf on the node and restart the coredns pods. Either one works around the problem but this should be better handled, probably by fixing the Docker DNS configuration (if possible) through Kind config

Could you check with bpf.hostLegacyRouting=true

It was observed when it was true. It's true right now, and

root@dnsutils:/# dig @172.18.0.1 ipquail.com
;; communications error to 172.18.0.1#53: timed out
;; communications error to 172.18.0.1#53: timed out

May 30 '24 17:05 jorhett

@julianwiedmann look here, the CI tests for Cilium note this problem and hack around it https://github.com/cilium/cilium/blob/main/contrib/scripts/kind.sh#L250-L255

This was originally documented in #23283 and was marked resolved by #30321 but that only fixes it for people using the Cilium CI scripts, as noted in #31118 so really it's not solved at all

Jun 04 '24 20:06 jorhett

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Aug 20 '24 01:08 github-actions[bot]

I have the same problem, but with Talos and ARM servers.

Aug 20 '24 07:08 mrclrchtr

But when bpfMasquerade is happening these rules are never hit.

Could you check with bpf.hostLegacyRouting=true ?

Good see ! that's fix issue for my context

talos on arm hcloud VM with this cilium values:

values.yaml

prometheus: &prome
  enabled: false
  serviceMonitor:
    trustCRDsExist: true
    enabled: true
k8sServiceHost: 127.0.0.1
k8sServicePort: 7445

ipam:
  mode: kubernetes

routingMode: native
ipv4NativeRoutingCIDR: 10.0.0.0/16

loadBalancer:
  mode: dsr

bpf:
  hostLegacyRouting: true
  masquerade: true

envoy:
  enabled: true
  prometheus: *prome

encryption:
  enabled: true
  type: wireguard
  nodeEncryption: true

kubeProxyReplacement: true
localRedirectPolicy: true

operator:
  prometheus: *prome
  replicas: 2

hubble:
  relay:
    enabled: true
    prometheus: *prome
  ui:
    enabled: true
    rollOutPods: true
    podLabels:
      traefik.home.arpa/ingress: allow
  metrics:
    enableOpenMetrics: true
    enabled:
      - dns:query
      - drop
      - tcp
      - flow
      - port-distribution
      - icmp
      - http

resources:  # for agent
  limits:
    memory: 1Gi

### required for Talos  ###
securityContext:
  capabilities:
    ciliumAgent:
      - CHOWN
      - KILL
      - NET_ADMIN
      - NET_RAW
      - IPC_LOCK
      - SYS_ADMIN
      - SYS_RESOURCE
      - DAC_OVERRIDE
      - FOWNER
      - SETGID
      - SETUID
    cleanCiliumState: [NET_ADMIN, SYS_ADMIN, SYS_RESOURCE]

cgroup:
  autoMount:
    enabled: false
  hostRoot: /sys/fs/cgroup

logOptions:
  format: json

With local-cache-dns LRP setup.

Oct 25 '24 19:10 samos667

Guys, I am experiencing this issue with a clean Talos Linux installation, 3 control plane nodes, hosted on Proxmox. I have public and private networks (eth0 / eth1 respectively).

Here's the configuration I am applying:

helm template cilium cilium/cilium \
	--version 1.16.3 \
	--namespace kube-system \
	--set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
	--set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
	--set cgroup.autoMount.enabled=false \
	--set cgroup.hostRoot=/sys/fs/cgroup \
	--set kubeProxyReplacement=true \
	--set k8sServiceHost=127.0.0.1 \
	--set k8sServicePort=7445 \
	--set ipv4.enabled=true \
	--set ipv4NativeRoutingCIDR="10.244.0.0/16" \
	--set ipam.operator.clusterPoolIPv4PodCIDRList="10.244.0.0/16" \
	--set devices="eth0 eth1" \
	--set routingMode=native \
	--set autoDirectNodeRoutes=true \
	--set bpf.masquerade=true \
	--set bpf.hostLegacyRouting=true \
	--set bpf.datapathMode=veth \
	--set enableIPv4Masquerade=true \
	> cilium.yaml

I see DNS resolution failures from: cilium connectivity test

.[=] [cilium-test-1] Test [no-policies] [2/102]
...................
  ℹ️  curl stdout:
  :0 -> :0 = 000
  ℹ️  curl stderr:
  curl: (28) Resolving timed out after 2001 milliseconds
curl: (28) Resolving timed out after 2001 milliseconds
curl: (28) Resolving timed out after 2001 milliseconds
curl: (28) Resolving timed out after 2001 milliseconds

kubectl logs --namespace=kube-system -l k8s-app=kube-dns

[INFO] 10.244.0.139:55368 - 14666 "A IN one.one.one.one. udp 44 false 1232" - - 0 2.00174321s
[ERROR] plugin/errors: 2 one.one.one.one. A: read udp 10.244.0.220:60449->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:49914 - 8235 "AAAA IN one.one.one.one. udp 44 false 1232" - - 0 2.000752971s
[ERROR] plugin/errors: 2 one.one.one.one. AAAA: read udp 10.244.0.220:39737->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:49914 - 49060 "A IN one.one.one.one. udp 44 false 1232" - - 0 2.000909955s
[ERROR] plugin/errors: 2 one.one.one.one. A: read udp 10.244.0.220:37276->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:49914 - 49060 "A IN one.one.one.one. udp 44 false 1232" - - 0 2.000824036s
[ERROR] plugin/errors: 2 one.one.one.one. A: read udp 10.244.0.220:40001->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:49914 - 8235 "AAAA IN one.one.one.one. udp 44 false 1232" - - 0 2.001309189s
[ERROR] plugin/errors: 2 one.one.one.one. AAAA: read udp 10.244.0.220:37683->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:35172 - 36517 "AAAA IN one.one.one.one. udp 44 false 1232" - - 0 2.000911343s
[ERROR] plugin/errors: 2 one.one.one.one. AAAA: read udp 10.244.0.22:35798->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:53829 - 54548 "AAAA IN one.one.one.one. udp 44 false 1232" - - 0 2.000969259s
[ERROR] plugin/errors: 2 one.one.one.one. AAAA: read udp 10.244.0.22:56461->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:53829 - 25924 "A IN one.one.one.one. udp 44 false 1232" - - 0 2.000747851s
[ERROR] plugin/errors: 2 one.one.one.one. A: read udp 10.244.0.22:44417->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:53829 - 54548 "AAAA IN one.one.one.one. udp 44 false 1232" - - 0 2.001070964s
[ERROR] plugin/errors: 2 one.one.one.one. AAAA: read udp 10.244.0.22:49799->169.254.116.108:53: i/o timeout
[INFO] 10.244.0.139:53829 - 25924 "A IN one.one.one.one. udp 44 false 1232" - - 0 2.001301272s
[ERROR] plugin/errors: 2 one.one.one.one. A: read udp 10.244.0.22:55786->169.254.116.108:53: i/o timeout

Nov 05 '24 20:11 rkerno

Cilium tests are now working for my environment, however it didn't work when I recreated the yaml using bpf.hostLegacyRouting=true and applied it to a cluster that had Cilium installed without this setting.

I have found that the only reliable way to test different Cilium settings is to restore the cluster from backup before Cilium is installed, and then install Cilium fresh with the new settings. Whaetver logic Cilium has to update existing settings appears to behave differently to a fresh installation. Some settings appear to take effect after deleting the Cilium pods and letting Kubernetes recreate them, however I have not found this approach to be reliable for all configuration changes.

Is this a known issue, or is it documented somewhere what configuration settings have been tested and confirmed can be applied to an existing installation (happy to raise a new issue for this if you prefer)?

Nov 05 '24 22:11 rkerno

I think this is the underlying cause of my challenges: https://github.com/cilium/cilium/issues/29413

Nov 07 '24 22:11 rkerno

@rkerno did you restart all cilium-agent* pods after applying modification ?

Nov 10 '24 17:11 samos667

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Jan 10 '25 02:01 github-actions[bot]

cilium cilium copied to clipboard

Problems with coredns timeouts and pods DNS resolution with bpf.masquerade enabled

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

cilium
cilium copied to clipboard