bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

1.28 os image has a conflicting DNS server running on port 53 preventing node-local-dns from running

Open ami-descope opened this issue 1 year ago • 16 comments

Platform I'm building on: x86_x64, amd

What I expected to happen: Should allow running host port services like node-local-dns without a conflict

What actually happened: node local dns crashes with

Listen: listen tcp 0.0.0.0:53: bind: address already in use

How to reproduce the problem: run bottlerocket-aws-k8s-1.28-x86_64-v1.17.0-53f322c2 image and node-local-dns from delivery hero chart

previous 1.27 bottlerocket worked fine switching from bottlerocket to AL2 works wityh 1.28

ami-descope avatar Jan 11 '24 15:01 ami-descope

Hello @ami-descope, thanks for opening this issue. The relevant difference between k8s 1.27 and k8s 1.28 on Bottlerocket is that we moved from wicked to systemd-networkd which includes using systemd-resolved (#3366). I believe resolved is creating this conflict. I'll need to take a deeper look but you might need to see if you can work around this by using a different port than 53 or reusing resolved. Can you post more of your node-local-dns config so I can try and replicate your error and see if I can find a workaround.

FWIW most search results for this error related to systemd-resolved are going to say "disable systemd-resolved" but that isn't really a great solution. We are using resolved for DNS resolution so stopping this may cause more problems.

yeazelm avatar Jan 12 '24 15:01 yeazelm

I've done some more investigation and following some basic guidance on node local dns I don't see the same issue as you are seeing on 1.28. This might be due to your specific configuration around what you are binding on. You might need to be more restrictive than binding on 0.0.0.0 for port 53. If you can confirm what binding you are using @ami-descope that might help us figure out what the issue is.

yeazelm avatar Jan 17 '24 00:01 yeazelm

Hi @yeazelm sorry for missing this

my config for the node local dns is (values.yaml)

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: eks.amazonaws.com/compute-type
          operator: NotIn
          values:
          - fargate
config:
  localDns: 169.254.20.25
prometheusScraping:
  enabled: false

i don't do any binds

here is my entire chart rendered:

---
# Source: node-local-dns/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sandbox-euw1-node-local-dns-72376ac0
  namespace: kube-system  
  labels:
    helm.sh/chart: node-local-dns-2.0.3
    app.kubernetes.io/name: node-local-dns
    app.kubernetes.io/instance: sandbox-euw1-node-local-dns-72376ac0
    k8s-app: sandbox-euw1-node-local-dns-72376ac0
    app.kubernetes.io/version: "1.22.23"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: "node-local-dns"
    app.kubernetes.io/created-by: "node-local-dns"
    app.kubernetes.io/part-of: "node-local-dns"
---
# Source: node-local-dns/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: sandbox-euw1-node-local-dns-72376ac0
  namespace: kube-system
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    helm.sh/chart: node-local-dns-2.0.3
    app.kubernetes.io/name: node-local-dns
    app.kubernetes.io/instance: sandbox-euw1-node-local-dns-72376ac0
    k8s-app: sandbox-euw1-node-local-dns-72376ac0
    app.kubernetes.io/version: "1.22.23"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: "node-local-dns"
    app.kubernetes.io/created-by: "node-local-dns"
    app.kubernetes.io/part-of: "node-local-dns"
data:
  Corefile: |
    cluster.local:53 {
        errors
        cache {
                success 9984 30
                denial 9984 5
        }
        reload
        loop
        bind 0.0.0.0
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        health
        }
    in-addr.arpa:53 {
        errors
        cache 30
        reload
        loop
        bind 0.0.0.0
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        }
    ip6.arpa:53 {
        errors
        cache 30
        reload
        loop
        bind 0.0.0.0
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        }
    .:53 {
        errors
        cache 30
        reload
        loop
        bind 0.0.0.0
        forward . __PILLAR__UPSTREAM__SERVERS__
        prometheus :9253
        }
---
# Source: node-local-dns/templates/service-upstream.yaml
apiVersion: v1
kind: Service
metadata:
  name: sandbox-euw1-node-local-dns-72376ac0-upstream
  namespace: kube-system
  labels:
    helm.sh/chart: node-local-dns-2.0.3
    app.kubernetes.io/name: node-local-dns
    app.kubernetes.io/instance: sandbox-euw1-node-local-dns-72376ac0
    k8s-app: sandbox-euw1-node-local-dns-72376ac0
    app.kubernetes.io/version: "1.22.23"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: "node-local-dns"
    app.kubernetes.io/created-by: "node-local-dns"
    app.kubernetes.io/part-of: "node-local-dns"
spec:
  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: 53
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: 53
  selector:
    k8s-app: kube-dns
---
# Source: node-local-dns/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: sandbox-euw1-node-local-dns-72376ac0
  namespace: kube-system
  labels:
    helm.sh/chart: node-local-dns-2.0.3
    app.kubernetes.io/name: node-local-dns
    app.kubernetes.io/instance: sandbox-euw1-node-local-dns-72376ac0
    k8s-app: sandbox-euw1-node-local-dns-72376ac0
    app.kubernetes.io/version: "1.22.23"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: "node-local-dns"
    app.kubernetes.io/created-by: "node-local-dns"
    app.kubernetes.io/part-of: "node-local-dns"
spec:
  clusterIP: None
  ports:
    - name: metrics
      port: 9253
      targetPort: 9253
  selector:
    k8s-app: sandbox-euw1-node-local-dns-72376ac0
---
# Source: node-local-dns/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sandbox-euw1-node-local-dns-72376ac0
  namespace: kube-system
  labels:
    helm.sh/chart: node-local-dns-2.0.3
    app.kubernetes.io/name: node-local-dns
    app.kubernetes.io/instance: sandbox-euw1-node-local-dns-72376ac0
    k8s-app: sandbox-euw1-node-local-dns-72376ac0
    app.kubernetes.io/version: "1.22.23"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: "node-local-dns"
    app.kubernetes.io/created-by: "node-local-dns"
    app.kubernetes.io/part-of: "node-local-dns"
spec:
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 10%
  selector:
    matchLabels:
      k8s-app: node-local-dns
  template:
    metadata:
      labels:
        k8s-app: node-local-dns
      annotations:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: eks.amazonaws.com/compute-type
                operator: NotIn
                values:
                - fargate
      priorityClassName: system-node-critical
      serviceAccountName: sandbox-euw1-node-local-dns-72376ac0
      hostNetwork: true
      dnsPolicy: Default  # Don't use cluster DNS.
      tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"
      - effect: "NoExecute"
        operator: "Exists"
      - effect: "NoSchedule"
        operator: "Exists"
      containers:
      - name: node-cache
        image: "registry.k8s.io/dns/k8s-dns-node-cache:1.22.23"
        resources:
          limits:
            memory: 128Mi
          requests:
            cpu: 25m
            memory: 128Mi
        args:
          - "-localip"
          - "169.254.20.25,172.20.0.10"
          - "-conf"
          - "/etc/Corefile"
          - "-upstreamsvc"
          - "sandbox-euw1-node-local-dns-72376ac0-upstream"
          - "-skipteardown=false"
          - "-setupinterface=true"
          - "-setupiptables=true"
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        - containerPort: 9253
          name: metrics
          protocol: TCP
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 60
          timeoutSeconds: 5
        volumeMounts:
        - mountPath: /run/xtables.lock
          name: xtables-lock
          readOnly: false
        - name: config-volume
          mountPath: /etc/coredns
        - name: kube-dns-config
          mountPath: /etc/kube-dns
      volumes:
      - name: xtables-lock
        hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
      - name: kube-dns-config
        configMap:
          name: sandbox-euw1-node-local-dns-72376ac0
          optional: true
      - name: config-volume
        configMap:
          name: sandbox-euw1-node-local-dns-72376ac0
          items:
            - key: Corefile
              path: Corefile.base


ami-descope avatar Jan 18 '24 14:01 ami-descope

From looking at this configuration, I'm pretty sure its the bind 0.0.0.0 lines in the Corefile config. The main problem is that systemd-resolved is already listening on the loopback interface so 0.0.0.0 will conflict. Is there a way you can scope these binds down to not include 0.0.0.0 and still get your node local dns to work (I admit I'm still ramping up on how this is used so I don't quite have a working setup to confirm this myself)? We might also be able to use an exclude to avoid the conflict but I haven't found the exact format to make that work.

yeazelm avatar Jan 23 '24 01:01 yeazelm

It's supposed to be a DNS cache server that nodes are configured via kubelet to get to that address (no port config)

Queries are first goes through the local DNS cache if it's a hit response returned If it's a miss, node local DNS goes to core DNS to fetch the values and cache those.

Since it's dns server I don't think I can run it on another port other than 53, I think resolve.conf only has DNS address.

On Mon, Jan 22, 2024, 8:01 PM Matthew Yeazel @.***> wrote:

From looking at this configuration, I'm pretty sure its the bind 0.0.0.0 lines in the Corefile config. The main problem is that systemd-resolved is already listening on the loopback interface so 0.0.0.0 will conflict. Is there a way you can scope these binds down to not include 0.0.0.0 and still get your node local dns to work (I admit I'm still ramping up on how this is used so I don't quite have a working setup to confirm this myself)? We might also be able to use an exclude https://coredns.io/plugins/bind/ to avoid the conflict but I haven't found the exact format to make that work.

— Reply to this email directly, view it on GitHub https://github.com/bottlerocket-os/bottlerocket/issues/3711#issuecomment-1905105272, or unsubscribe https://github.com/notifications/unsubscribe-auth/A7HNSL4ACKXDZJ5MCJTDL4TYP4DU5AVCNFSM6AAAAABBWVBHHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBVGEYDKMRXGI . You are receiving this because you were mentioned.Message ID: @.***>

ami-descope avatar Jan 23 '24 03:01 ami-descope

Thanks for the clarification @ami-descope. I think I have the right understanding for the local DNS caching. If believe the docs call out that the preference for what to bind to is a local scope address:

It's recommended to use an address with a local scope, for example, from the 'link-local' range '169.254.0.0/16' for IPv4 or from the 'Unique Local Address' range in IPv6 'fd00::/8'.

I'm not sure if you can set that via the helm chart you are using but when I set this to 169.254.0.1 on my aws-k8s-1.28 instance, the node local DNS did seem to be working (but I didn't confirm every use case so that is why I asked if there is something I might be missing). Can you see if using a specific local scope address works for your configuration?

yeazelm avatar Jan 23 '24 16:01 yeazelm

But the issue is that the port is taken no?

ami-descope avatar Jan 23 '24 17:01 ami-descope

But the issue is that the port is taken no?

The port is only taken for the loopback interface which conflicts with 0.0.0.0 which attempts to bind on all interfaces. So if you use ss to look at bound ports (specifically :53) for a default configuration on Bottlerocket running systemd-resolved (currently aws-k8s-1.28+ and aws-ecs-2), you get the following:

ss -tulpn | grep :53
udp   UNCONN 0      0                           127.0.0.54:53         0.0.0.0:*    users:(("systemd-resolve",pid=1516,fd=15))
udp   UNCONN 0      0                        127.0.0.53%lo:53         0.0.0.0:*    users:(("systemd-resolve",pid=1516,fd=13))
tcp   LISTEN 0      4096                     127.0.0.53%lo:53         0.0.0.0:*    users:(("systemd-resolve",pid=1516,fd=14))
tcp   LISTEN 0      4096                        127.0.0.54:53         0.0.0.0:*    users:(("systemd-resolve",pid=1516,fd=16))

If you avoid attempting to bind on those addresses (127.0.0.1/8) then it should work.

yeazelm avatar Jan 23 '24 17:01 yeazelm

Just to confirm, I have been running with an explicit locally scoped address over 5 clusters for the past year or so and have had no issues. As per the recommendation, I chose 169.254.20.10 in the link-local range.

We're not running 1.28 as yet, but was lurking on this issue as that upgrade is just around the corner.

The config was generated using the docs sed method...

kubedns=`kubectl get svc kube-dns -n kube-system -o jsonpath={.spec.clusterIP}`
domain=cluster.local
localdns=169.254.20.10
sed -i "s/__PILLAR__LOCAL__DNS__/$localdns/g; s/__PILLAR__DNS__DOMAIN__/$domain/g; s/__PILLAR__DNS__SERVER__/$kubedns/g" nodelocaldns.yaml

Our full config for clarity...

cluster.local:53 {
    errors
    cache {
            success 9984 30
            denial 9984 5
    }
    reload
    loop
    bind 169.254.20.10 172.16.0.10
    forward . __PILLAR__CLUSTER__DNS__ {
            force_tcp
    }
    prometheus :9253
    health 169.254.20.10:8080
    }
in-addr.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.16.0.10
    forward . __PILLAR__CLUSTER__DNS__ {
            force_tcp
    }
    prometheus :9253
    }
ip6.arpa:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.16.0.10
    forward . __PILLAR__CLUSTER__DNS__ {
            force_tcp
    }
    prometheus :9253
    }
.:53 {
    errors
    cache 30
    reload
    loop
    bind 169.254.20.10 172.16.0.10
    forward . __PILLAR__UPSTREAM__SERVERS__
    prometheus :9253
    }

theplatformer avatar Jan 23 '24 20:01 theplatformer

Thanks for the extra info @theplatformer, I think this confirms that the local address works and I don't anticipate issues with upgrading to k8s 1.28 and beyond while binding on addresses such as 169.254.20.10 since it won't conflict with the systemd-resolved binding.

yeazelm avatar Jan 23 '24 22:01 yeazelm

But the issue is that the port is taken no?

My local DNS address is 169.254.20.25

Is that conflicting with resolvd?

ami-descope avatar Jan 23 '24 22:01 ami-descope

But the issue is that the port is taken no?

My local DNS address is 169.254.20.25

Is that conflicting with resolvd?

The problem is that the actual config ends up trying to bind on 0.0.0.0 rather than just 169.254.20.25 so the helm chart you are using appears to not be taking that input and changing the bind configuration around it.

From your posted configuration:

data:
  Corefile: |
    cluster.local:53 {
        errors
        cache {
                success 9984 30
                denial 9984 5
        }
        reload
        loop
        bind 0.0.0.0.             <---- This is the problem config
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        health
        }
    in-addr.arpa:53 {
        errors
        cache 30
        reload
        loop
        bind 0.0.0.0.            <---- This is the problem config
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        }
    ip6.arpa:53 {
        errors
        cache 30
        reload
        loop
        bind 0.0.0.0.          <---- This is the problem config
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        }
    .:53 {
        errors
        cache 30
        reload
        loop
        bind 0.0.0.0.       <---- This is the problem config
        forward . __PILLAR__UPSTREAM__SERVERS__
        prometheus :9253
        }
---

You might try editing this in place to replace 0.0.0.0 with 169.254.20.25 via kubectl edit and if that works, then it might be worth following up with the helm chart owners to see if they can support this change.

yeazelm avatar Jan 23 '24 23:01 yeazelm

Hi, @yeazelm thanks in advance. Does systemd-resolved serve the same purpose as node-local-dns in caching pod DNS queries? Any docs about how the systemd-resolved worked?

just1900 avatar May 17 '24 15:05 just1900

Hi, @yeazelm thanks in advance. Does systemd-resolved serve the same purpose as node-local-dns in caching pod DNS queries? Any docs about how the systemd-resolved worked?

systemd-resolved is for host resolution where as node-local-dns sets up a DNS cache server as a DaemonSet intended to handle DNS resolving for the orchestrated containers. The configs mentioned in this thread are running a DNS cache server in a pod for this purpose, the problem is when one tries to bind to port 53 for all addresses, this conflicts with the existing host DNS resolver. Ideally, you shouldn't need to think about systemd-resolved at all on Bottlerocket. Nonetheless, https://www.freedesktop.org/software/systemd/man/latest/systemd-resolved.service.html is a good place to start to understand how systemd-resolved works.

yeazelm avatar Jun 07 '24 23:06 yeazelm

@ami-descope What Helm chart are you using because there is no official one on project GitHub here, just the YAML here? If you are using a Helm chart from any source you might create a PR to incorporate a change making the bind address configurable with a good default of link-local address 169.254.20.10 as an IPv4 default plus the CoreDNS service "kube-dns" IP

youwalther65 avatar Jun 28 '24 09:06 youwalther65

It seems to work for me after using the most popular but unofficial Helm Chart for node-local-dns and setting the value bindIp: true.

This value was introduced because of this issue. See this PR: https://github.com/deliveryhero/helm-charts/pull/563

simon-wessel avatar Jun 28 '24 14:06 simon-wessel