DNS resolution lag when rule is on CiliumClusterwideNetworkPolicy
Is there an existing issue for this?
- [X] I have searched the existing issues
Version
higher than v1.16.0 and lower than v1.17.0
What happened?
On an EKS cluster with around 100 namespaces, over 200 nodes and over 700 deployments. Karpenter for node provisioning. Bottlerocket OS. All nodes on the cluster flushed and tested on brand new nodes.
Cilium version 1.16.0 K8s version 1.26
Cilium deployed via Helm Using eBPF instead of kube-proxy Using CiliumLocalRedirectPolicy for local dns use case using layer 7 rules
SAME SETUP IN OTHER 3 SMALLER CLUSTERS WORK FLAWLESSLY
Setting up the following egress policy to achieve layer 7:
- toEndpoints:
- matchLabels:
io.kubernetes.pod.namespace: kube-system
k8s-app: kube-dns
toPorts:
- ports:
- port: "53"
protocol: ANY
rules:
dns:
- matchPattern: "*"
- toEndpoints:
- matchLabels:
io.kubernetes.pod.namespace: kube-system
k8s-app: node-local-dns
toPorts:
- ports:
- port: '53'
protocol: ANY
rules:
dns:
- matchPattern: '*'
When this is on our CiliumClusterwideNetworkPolicy DNS resolution takes a very long time, when moving it out of the global policy and into a CiliumNetworkPolicy it starts working great.
For reference, this is our CiliumLocalRedirectPolicy:
apiVersion: "cilium.io/v2"
kind: CiliumLocalRedirectPolicy
metadata:
name: "nodelocaldns"
namespace: kube-system
spec:
redirectFrontend:
serviceMatcher:
serviceName: kube-dns
namespace: kube-system
redirectBackend:
localEndpointSelector:
matchLabels:
k8s-app: node-local-dns
toPorts:
- port: "53"
name: dns
protocol: UDP
- port: "53"
name: dns-tcp
protocol: TCP
Attaching screencapture of the symptoms
https://github.com/user-attachments/assets/01436e4a-39d7-402c-aa77-da5a8afbaa7f
.
How can we reproduce the issue?
As mentioned, exactly same setup (deployed with terraform) works fine on other 3 smaller clusters, so it is not clear exactly why this happens or how to reproduce... Shooting a shot in the dark seeking help troubleshooting/identifying.
Cilium Version
1.16.0
Kernel Version
5.15.162 #1 SMP Fri Jul 26 21:00:52 UTC 2024 x86_64 GNU/Linux
Kubernetes Version
1.26
Regression
No response
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Cilium Users Document
- [ ] Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for logging this @leandro-loos, it does seem pretty odd, especially given that it only becomes a problem in a larger cluster.
Asking around, the first thing folks have thought of is checking how the policy is being applied, and making sure that the Clusterwide policy is not also applying to the node-local-dns pod? Would you please be able to post the whole CiliumClusterwideNetworkPolicy (and preferably as sysdump using https://docs.cilium.io/en/stable/operations/troubleshooting/#automatic-log-state-collection as well)?
We have a very similar(if not identical) problem. More details from our end as well. We are also running on EKS k8s version 1.30. Cilium version 1.16.0
We have(had) a ccnp as:
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
name: allow-all-egress
spec:
endpointSelector: {}
egress:
- toEntities:
- cluster
- kube-apiserver
- toEndpoints:
- matchLabels:
io.kubernetes.pod.namespace: kube-system
k8s-app: kube-dns
toPorts:
- ports:
- port: "53"
protocol: UDP
- port: "443"
rules:
dns:
- matchPattern: "*"
ingress:
- fromEntities:
- kube-apiserver
With this in place, now that our cluster hosts more pods/nodes etc. we have seen that it behaves as if dns is unresponsive. Behavior is - we see in several apps/autoscaler etc. that timeout trying to reach DNS - when we observe through hubble etc. we don't see any package dropped - but they also never reach kube-dns/coredns. So probably as described here they are being processed by cilium too long.
Once we remove this CCNP & create a CNP in all namespaces with exact same settings -
- toEntities:
- cluster
- kube-apiserver
- toEndpoints:
- matchLabels:
io.kubernetes.pod.namespace: kube-system
k8s-app: kube-dns
toPorts:
- ports:
- port: "53"
protocol: ANY
- port: "443"
protocol: ANY
rules:
dns:
- matchPattern: '*'
endpointSelector:
matchLabels:
io.kubernetes.pod.namespace: default
ingress:
- fromEntities:
- cluster
- kube-apiserver
Then it works fine.
Hey, I have a similar issue: I'm upgrading the cilium from 1.14.12 to 1.16.1 in my eks 1.29 clusters. post cilium upgrade I see in different namespaces applications not able to connect to DNS (core-dns) which was working fine in 1.14.12
anything changed from 1.14.12 to 1.16.1 ? I saw a similar issue with 1.15.8, so thought of going with this stable version 1.16.1
although adding it as cnp from ccnp helps connect to the DNS, certain applications connecting to clusterIP start failing.
This looks interesting, thanks for the report. Any chance that you could upload a sysdump of the cluster with the CCNP applied and DNS traffic suffering from high latency? Ideally even two sysdumps, one with the working config and one with the broken one.
I'm primarily interested in
- the full cilium config, amongst others to understand whether you use
dnsproxy-enable-transparent-mode - cilium-agent logs, to see whether there are warnings/errors around timeouts in the in-agent DNS proxy
- load statistics - whether this is a case of CPU starvation
- policy state - if none of the other ideas lead anywhere; to see how the policy engine state differs between the CCNP and CNP setup.
Hello @bimmlerd - sorry for the late reply, I were checking with company internal security folk & we are not allowed to share those logs/sysdumps as you require :(
But we had another case we can mention - even when having those DNS entries as CNP in all namespaces we had this behavior happen. But it happened by far less frequent - e.g. once in 2 weeks (compared to each day, at least once, with CCNP)
What we figured out this time is - flushing some tables made the connection work again i.e.
cilium-dbg bpf ct flush global
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Probably not stale. Please let us know whether you can reproduce with 1.16.5 once out in a few days, some fixes to the toFQDN policy implementation have been merged.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has not seen any activity since it was marked stale. Closing.