cilium DNS resolution lag when rule is on CiliumClusterwideNetworkPolicy

Is there an existing issue for this?

[X] I have searched the existing issues

Version

higher than v1.16.0 and lower than v1.17.0

What happened?

On an EKS cluster with around 100 namespaces, over 200 nodes and over 700 deployments. Karpenter for node provisioning. Bottlerocket OS. All nodes on the cluster flushed and tested on brand new nodes.

Cilium version 1.16.0 K8s version 1.26

Cilium deployed via Helm Using eBPF instead of kube-proxy Using CiliumLocalRedirectPolicy for local dns use case using layer 7 rules

SAME SETUP IN OTHER 3 SMALLER CLUSTERS WORK FLAWLESSLY

Setting up the following egress policy to achieve layer 7:

    - toEndpoints:
        - matchLabels:
            io.kubernetes.pod.namespace: kube-system
            k8s-app: kube-dns
      toPorts:
        - ports:
            - port: "53"
              protocol: ANY
          rules:
            dns:
              - matchPattern: "*"
    - toEndpoints:
        - matchLabels:
            io.kubernetes.pod.namespace: kube-system
            k8s-app: node-local-dns
      toPorts:
        - ports:
            - port: '53'
              protocol: ANY
          rules:
            dns:
              - matchPattern: '*'

When this is on our CiliumClusterwideNetworkPolicy DNS resolution takes a very long time, when moving it out of the global policy and into a CiliumNetworkPolicy it starts working great.

For reference, this is our CiliumLocalRedirectPolicy:

apiVersion: "cilium.io/v2"
kind: CiliumLocalRedirectPolicy
metadata:
  name: "nodelocaldns"
  namespace: kube-system
spec:
  redirectFrontend:
    serviceMatcher:
      serviceName: kube-dns
      namespace: kube-system
  redirectBackend:
    localEndpointSelector:
      matchLabels:
        k8s-app: node-local-dns
    toPorts:
      - port: "53"
        name: dns
        protocol: UDP
      - port: "53"
        name: dns-tcp
        protocol: TCP

Attaching screencapture of the symptoms

https://github.com/user-attachments/assets/01436e4a-39d7-402c-aa77-da5a8afbaa7f

.

How can we reproduce the issue?

As mentioned, exactly same setup (deployed with terraform) works fine on other 3 smaller clusters, so it is not clear exactly why this happens or how to reproduce... Shooting a shot in the dark seeking help troubleshooting/identifying.

Cilium Version

1.16.0

Kernel Version

5.15.162 #1 SMP Fri Jul 26 21:00:52 UTC 2024 x86_64 GNU/Linux

Kubernetes Version

1.26

Regression

No response

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Cilium Users Document

[ ] Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Aug 12 '24 18:08 leandro-loos

Thanks for logging this @leandro-loos, it does seem pretty odd, especially given that it only becomes a problem in a larger cluster.

Asking around, the first thing folks have thought of is checking how the policy is being applied, and making sure that the Clusterwide policy is not also applying to the node-local-dns pod? Would you please be able to post the whole CiliumClusterwideNetworkPolicy (and preferably as sysdump using https://docs.cilium.io/en/stable/operations/troubleshooting/#automatic-log-state-collection as well)?

Aug 19 '24 03:08 youngnick

We have a very similar(if not identical) problem. More details from our end as well. We are also running on EKS k8s version 1.30. Cilium version 1.16.0

We have(had) a ccnp as:

apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: allow-all-egress
spec:
  endpointSelector: {}
  egress:
    - toEntities:
        - cluster
        - kube-apiserver
    - toEndpoints:
      - matchLabels:
          io.kubernetes.pod.namespace: kube-system
          k8s-app: kube-dns
      toPorts:
        - ports:
            - port: "53"
              protocol: UDP
            - port: "443"
          rules:
            dns:
              - matchPattern: "*"
  ingress:
    - fromEntities:
        - kube-apiserver

With this in place, now that our cluster hosts more pods/nodes etc. we have seen that it behaves as if dns is unresponsive. Behavior is - we see in several apps/autoscaler etc. that timeout trying to reach DNS - when we observe through hubble etc. we don't see any package dropped - but they also never reach kube-dns/coredns. So probably as described here they are being processed by cilium too long.

Once we remove this CCNP & create a CNP in all namespaces with exact same settings -

  - toEntities:
    - cluster
    - kube-apiserver
  - toEndpoints:
    - matchLabels:
        io.kubernetes.pod.namespace: kube-system
        k8s-app: kube-dns
    toPorts:
    - ports:
      - port: "53"
        protocol: ANY
      - port: "443"
        protocol: ANY
      rules:
        dns:
        - matchPattern: '*'
  endpointSelector:
    matchLabels:
      io.kubernetes.pod.namespace: default
  ingress:
  - fromEntities:
    - cluster
    - kube-apiserver

Then it works fine.

Aug 26 '24 08:08 celalsahin

Hey, I have a similar issue: I'm upgrading the cilium from 1.14.12 to 1.16.1 in my eks 1.29 clusters. post cilium upgrade I see in different namespaces applications not able to connect to DNS (core-dns) which was working fine in 1.14.12

anything changed from 1.14.12 to 1.16.1 ? I saw a similar issue with 1.15.8, so thought of going with this stable version 1.16.1

although adding it as cnp from ccnp helps connect to the DNS, certain applications connecting to clusterIP start failing.

Sep 19 '24 07:09 harshashenoy98

This looks interesting, thanks for the report. Any chance that you could upload a sysdump of the cluster with the CCNP applied and DNS traffic suffering from high latency? Ideally even two sysdumps, one with the working config and one with the broken one.

I'm primarily interested in

the full cilium config, amongst others to understand whether you use dnsproxy-enable-transparent-mode
cilium-agent logs, to see whether there are warnings/errors around timeouts in the in-agent DNS proxy
load statistics - whether this is a case of CPU starvation
policy state - if none of the other ideas lead anywhere; to see how the policy engine state differs between the CCNP and CNP setup.

Sep 20 '24 08:09 bimmlerd

Hello @bimmlerd - sorry for the late reply, I were checking with company internal security folk & we are not allowed to share those logs/sysdumps as you require :(

But we had another case we can mention - even when having those DNS entries as CNP in all namespaces we had this behavior happen. But it happened by far less frequent - e.g. once in 2 weeks (compared to each day, at least once, with CCNP)

What we figured out this time is - flushing some tables made the connection work again i.e.

cilium-dbg bpf ct flush global

Oct 02 '24 12:10 celalsahin

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Dec 09 '24 02:12 github-actions[bot]

Probably not stale. Please let us know whether you can reproduce with 1.16.5 once out in a few days, some fixes to the toFQDN policy implementation have been merged.

Dec 09 '24 09:12 bimmlerd

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Feb 08 '25 01:02 github-actions[bot]

This issue has not seen any activity since it was marked stale. Closing.

Feb 22 '25 01:02 github-actions[bot]