concourse icon indicating copy to clipboard operation
concourse copied to clipboard

DNS Resolution Issue in Kubernetes

Open srheaume opened this issue 5 years ago • 12 comments

Summary

Since version 7.0.0, the DNS resolution is no longer properly done in Kubernetes. For example, a resource, such as Git, that refers to an external URL will not be able to read from a remote repository.

Solve the problem as proposed in the documentation will not solve the problem completely. Scripts in the pipelines that refer to URLs local to Kubernetes will not be resolved anymore (e.g. "http://concourse-web.concourse.svc.cluster.local:8080).

In order for the new version of Concourse to be as easy to use as it was before in a Kubernetes environment, it would be nice if workers could get the DNS configuration of their hosts to get the same DNS resolution as in previous versions of Concourse.

This is related to the following issue: https://github.com/concourse/concourse/issues/6544

Steps to reproduce

Here are the details on the environment in which the problem can be reproduced:

  • Minikube Version: 1.17.1 (using the VirtualBox driver)
  • Kubernetes Version: 1.20.2
  • Concourse Helm-Chart Version: 14.6.0
  • Concourse Version: 7.0.0

Note: Default values with the Helm Chart will reproduce the issue.

Test Pipeline

resources:
- name: concourse-docs-git
  type: git
  icon: github
  source:
    uri: [email protected]:concourse/docs.git
    branch: master
    git_config:
          - name: core.sshCommand
            value: "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
    private_key: |
      -----BEGIN OPENSSH PRIVATE KEY-----
      ......
      -----END OPENSSH PRIVATE KEY-----

jobs:
- name: job
  public: true
  plan:
  - get: concourse-docs-git
    trigger: true
  - task: list-files
    config:
      inputs:
        - name: concourse-docs-git
      platform: linux
      image_resource:
        type: registry-image
        source: { repository: busybox }
      run:
        path: ls
        args: ["-la", "./concourse-docs-git"]

Triaging info

  • Concourse version: 7.0.0
  • Did this used to work? yes

srheaume avatar Feb 18 '21 19:02 srheaume

Just got around to trying to reproduce this and I still can't!!!!?!?!

Followed your setup; minikube version v1.17.1 and k8s 1.20.2 with chart 14.6.2 using default values.yml.

  • I modified the pipeline to use https instead of the ssh uri for the repo initially because I didn't want to fetch my private key
  • With https the check and get worked. Okay, maybe it's just an ssh problem?
  • I then changed the pipeline back to what you had; ssh uri with a private key for auth. Both check and get steps worked. No errors.
  • I hijacked onto the check container ran the following command to test the cluster DNS
    curl test-web.default.svc.cluster.local:8080
    
    and it worked, I got back the html from the Concourse dashboard.
    • I could also ping to confirm that the DNS was resolving to the web service's IP
  • Checked /etc/resolv.conf and it contains the worker's address. Not sure if this is expected, would need to go through the code.
    # cat /etc/resolv.conf
    nameserver 172.17.0.6
    

Have you tried enabling DNS proxy for the workers? If you look for dnsProxyEnable in the values.yaml, you'll get two hits: one for garden and a second for containerd. Could you try enabling that and see if it fixes the problem for you? Looking at our topgun/k8s tests, in-cluster DNS resolution should work, in most circumstances, with and without the dns proxy.

https://github.com/concourse/concourse/blob/4c5121cb96c4299ee4330114ec70fdc3dfb7e102/topgun/k8s/dns_proxy_test.go#L89-L120

taylorsilva avatar Mar 02 '21 16:03 taylorsilva

I have tried to set the dnsProxyEnable property. No success. If you're willing, we can organize a Slack call. I might be easier.

srheaume avatar Mar 03 '21 21:03 srheaume

Let's do it! I'm good with slack, though if you're on discord you could also join the Concourse discord server. The entire team is always hanging out there: https://discord.gg/MeRxXKW

taylorsilva avatar Mar 03 '21 21:03 taylorsilva

Was able to reproduce it, the difference was --driver virtualbox when starting minikube. Missed that note when I first tried reproducing. image

taylorsilva avatar Mar 15 '21 20:03 taylorsilva

  • fly hijack onto the check container
  • contents of /etc/resolv.conf
    • nameserver 172.17.0.6
  • probably install tools like dig, ping, curl
  • try reaching hostnames and IP's directly (http://github.com/ vs 140.82.113.4)
    • Ping by default try to use IPv6
    • using IPv4 resulted in a successful DNS lookup
# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
4: wr78739ktjel-1@if5: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP qlen 1
    link/ether da:40:91:13:d3:c3 brd ff:ff:ff:ff:ff:ff

  • kubectl exec onto the concourse worker
  • contents of /etc/resolv.conf
    • nameserver 10.96.0.10 search default.svc.cluster.local svc.cluster.local cluster.local options ndots:5
  • probably install tools like dig, ping, curl
  • try reaching hostnames and IP's directly (http://github.com/ vs 140.82.113.4)
  • check IPtable rules
    • capture IP tables rules for comparison later to a docker-compose concourse
# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
3: wbrdg-0a500000: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 3e:20:6a:32:0a:d8 brd ff:ff:ff:ff:ff:ff
5: wr78739ktjel-0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master wbrdg-0a500000 state UP mode DEFAULT group default qlen 1
    link/ether 9e:69:49:f9:03:f4 brd ff:ff:ff:ff:ff:ff link-netnsid 1
14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ac:11:00:06 brd ff:ff:ff:ff:ff:ff link-netnsid 0

  • Did this stop working at a particular k8s version? Currently we're on 1.20.
    • Does DNS lookup work on earlier versions of k8s with Concourses 7.x?
    • k8s 1.19.8: dns lookup fails
    • k8s 1.16.15: dns lookup fails
    • at this point we realize that the problem is likely with a change in Concourse, not k8s
    • The team's prod k8s cluster is on 1.18 and we're not seeing this issue with Concourse 7.x, though we're using containerd. We do use in-cluster dns when running the topgun/k8s tests and those are passing.
    • user reported that containerd does work for them, except for local in-cluster DNS resolution
    • tried initial test pipeline on k8s 1.20.2 using Concourse 6.7.5 and dns resolution works!

So the problem is in Concourse, between versions 6.7.x and 7.x

  • gdn and runc versions are the same in the latest 6.7.x and 7.0.x and 7.1.x - gdn version is 1.19.18, runc is v1.0.0-rc92
  • not sure what else we could have changed that would cause this issue between these versions of Concourse
  • Could it be something that changed in the Concourse chart? Did we add something to the worker templates? Maybe try using an older version of the chart with Concourse 7.x?
    • Tried this and got the same dns error. it's something we changed in 7.x
  • Review commits made in the worker/ dir between 6.7.x and 7.0.x

EstebanFS avatar Mar 16 '21 21:03 EstebanFS

With Concourse 7.0.0 On the concourse worker container, we looked at the veth device that Concourse made vs the one that k8s made:

# concourse veth device
root@concourse-worker-1:/# ip addr show wr9aqgv7r5b4-0
5: wr9aqgv7r5b4-0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master wbrdg-0a500000 state UP group default qlen 1
    link/ether be:14:2c:7f:de:66 brd ff:ff:ff:ff:ff:ff link-netnsid 1

# k8s veth device
root@concourse-worker-1:/# ip addr show eth0
52: eth0@if53: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:07 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.7/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

We thought the bridge device was not being made but we were wrong. It is being made, we just didn't notice it until we looked at 6.7.5 for some reason.


With Concourse 6.7.5 On the Concourse worker container we get the following which looks the same as 7.0.0

root@concourse-worker-0:/# ip addr show wr9c9b9chlma-0
8: wr9c9b9chlma-0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master wbrdg-0a500000 state UP group default qlen 1
    link/ether 32:d7:ff:bf:4c:02 brd ff:ff:ff:ff:ff:ff link-netnsid 1

root@concourse-worker-0:/# ip addr show eth0
76: eth0@if77: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:07 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.7/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

root@concourse-worker-0:/# ip addr show wbrdg-0a500000
6: wbrdg-0a500000: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether b6:79:6d:80:be:79 brd ff:ff:ff:ff:ff:ff
    inet 10.80.0.1/30 brd 10.80.0.3 scope global wbrdg-0a500000
       valid_lft forever preferred_lft forever

EstebanFS avatar Mar 17 '21 15:03 EstebanFS

@chenbh mentioned this looked similar to https://github.com/concourse/concourse/issues/6038

We updated our pipeline to use the ubuntu based version of the git-resource:

resource_types:
- name: git
  type: registry-image
  source:
    repository: concourse/git-resource
    tag: ubuntu

resources:
- name: concourse-docs-git
  type: git
  check_every: 99h
  icon: github
  source:
    uri: https://github.com/concourse/docs.git

jobs:
- name: job
  public: true
  plan:
  - get: concourse-docs-git
    trigger: true
  - task: list-files
    config:
      inputs:
        - name: concourse-docs-git
      platform: linux
      image_resource:
        type: registry-image
        source: { repository: busybox }
      run:
        path: ls
        args: ["-la", "./concourse-docs-git"]

and it worked! image

So something in the alpine image our resources are based on is causing dns lookup to fail in k8s.


Decided to see what happens if we run the git-resource as a pod directly on k8s:

$ kubectl run git-alpine -it --image=concourse/git-resource:1.12 -- sh
# ping github.com
ping: bad address 'github.com'
...

It fails!

taylorsilva avatar Mar 17 '21 18:03 taylorsilva

That would also explain why 6.7.5 works - the base resource types it comes with uses an older image, which also means an older library.

Can you try changing the the vm's dns resolver to --natdnsproxy as suggested by https://gitlab.alpinelinux.org/alpine/aports/-/issues/11879#note_111445

Shakes fist at VirtualBox + Mac

chenbh avatar Mar 17 '21 19:03 chenbh

@srheaume Besides virtualbox, what other k8s environments are you seeing this issue in? GKE? EKS? other cloud providers? See Daniel's comment above^


I tried reproducing on GKE v1.18.12-gke.1210 but could not.

$ kubectl run git-alpine -it --image=concourse/git-resource:1.12 -- sh
If you don't see a command prompt, try pressing enter.
/usr/lib # ping github.com
PING github.com (140.82.114.3): 56 data bytes
64 bytes from 140.82.114.3: seq=0 ttl=50 time=26.893 ms
64 bytes from 140.82.114.3: seq=1 ttl=50 time=27.110 ms
^C
--- github.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 26.893/27.001/27.110 ms

Stopped my minikube vm and ran VBoxManage modifyvm minikube --natdnsproxy1 on as per Daniel' comment. That didn't fix the problem though. Tried VBoxManage modifyvm minikube --natdnsproxy2 on and still got the error. 🤷

taylorsilva avatar Mar 17 '21 20:03 taylorsilva

@taylorsilva Sorry for the delay, I had a crazy week. I successfully deployed the latest version of Concourse on OpenShift. I think the problem seems to be related with VirtualBox.

srheaume avatar Mar 24 '21 14:03 srheaume

Thanks for verifying. I'm going to de-prioritize this for the team but leave it open.

To re-cap, the issue is with alpine-based container images running on k8s on virtualbox. A workaround is to use the ubuntu-based images instead which one can get using the ubuntu tag on any of the concourse/<resoruce> resources.

taylorsilva avatar Mar 24 '21 19:03 taylorsilva

I am running into this issue with Concourse v7.4.0 on EKS 1.20

laurenty avatar Jun 14 '22 21:06 laurenty