github-action icon indicating copy to clipboard operation
github-action copied to clipboard

Unreliable DNS during Container Builds

Open tpanum opened this issue 2 years ago • 13 comments

I have a GitHub Actions pipeline that roughly process like so:

  1. Connect to a private Tailscale network using this Github Action.
  2. Start a docker build of a multi-stage Dockerfile where an internal python package index[1] is accessed to pull dependencies.

[1]: This package index is available within the Tailscale network and is discovered using DNS of an internal DNS server configured in Tailscale.

Step 2 occasionally fail due to DNS lookups for the package index failing, thus DNS resolving not working reliably.

tpanum avatar Dec 11 '23 12:12 tpanum

Are you using buildx? I was having this issue, but it seems to be resolved by setting network=host in driver-opts

ivorisoutdoors avatar Jan 12 '24 17:01 ivorisoutdoors

I have been using buildx, yeah. I have tried using the network=host in the past, but did not make it more reliable from my experiences. I ended up doing dig example.com > $IP and then use --add-host.

tpanum avatar Jan 13 '24 16:01 tpanum

I was dealing with basically the same issue--container builds that rely on internal resources failing to resolve dns--and the thing that finally fixed it in my case was to configure buildx with the internal dns hosts:

- name: Setup buildx with internal DNS
  uses: docker/setup-buildx-action@v3
  with:
    config-inline: |
      [dns]
        nameservers="<comma separated list>"

henworth avatar Mar 15 '24 13:03 henworth

Even that is also incosistant, it works sometimes. Mostly doesn't.

For some reason, it's using 168.63.129.16 for dns resolution.

kdpuvvadi avatar Mar 26 '24 11:03 kdpuvvadi

Following @henworth's example, I had to change it slightly to get it working:

with:
  buildkitd-config-inline: |
    [dns]
      nameservers=["..."]

tpanum avatar Mar 27 '24 10:03 tpanum

Still same though @tpanum.

mines looks like this

- name: Set up Docker Buildx
        uses: docker/[email protected]
        with:
          buildkitd-config-inline: |
            [dns]
              nameservers=["100.78.xx.xx","100.78.xx.xx"]

And errors out

#45 ERROR: failed to push git.local.puvvadi.net/***/blog:e234247: failed to do request: Head "https://git.local.puvvadi.net/v2/***/blog/blobs/sha256:bbba97e7b63ba8e2a28aa20a0a10b6ba491f29a395e8f7cc5bdf6ff4fe783000": dial tcp: lookup git.local.puvvadi.net on 168.63.129.16:53: no such host

kdpuvvadi avatar Mar 27 '24 14:03 kdpuvvadi

We've also seen this same issue on docker run github actions steps, so it's not a docker build issue only. For example, a job containing this bash step:

runner@fv-az1797-395:~/work/repo/repo$ docker run -it --rm \
                --env ENV_VAR \
                <aws_account_id>.dkr.ecr.us-east-2.amazonaws.com/repo:latest /bin/bash
root@35f3013f9caf:/opt/code# cat /etc/resolv.conf
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 168.63.129.16
nameserver 100.100.100.100
search vnw05d5vvvpeplv1mpaxmbipab.bx.internal.cloudapp.net tail1abc1.ts.net
root@35f3013f9caf:/opt/code#

This way we had many issues resolving Tailscale hosts inside this container. Fortunately, we managed to fix it using the --dns docker run option:

runner@fv-az1797-395:~/work/repo/repo$ docker run --dns=100.100.100.100 -it --rm \
                --env ENV_VAR \
                <aws_account_id>.dkr.ecr.us-east-2.amazonaws.com/repo:latest /bin/bash
root@35f3013f9caf:/opt/code# cat /etc/resolv.conf
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 100.100.100.100
search vnw05d5vvvpeplv1mpaxmbipab.bx.internal.cloudapp.net tail1abc1.ts.net
root@35f3013f9caf:/opt/code#

Regards

marcelofernandez avatar Apr 30 '24 20:04 marcelofernandez

This also creates a scenarion where any action with

runs:
  using: docker

is unreliable

jaxxstorm avatar Jul 07 '24 19:07 jaxxstorm

I think I've discovered why this happened, to me at least.

The GitHub actions runners are on 172.17.0.0/16

\n===== Network Interfaces =====
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host proto kernel_lo 
       valid_lft forever preferred_lft forever
4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
\n===== Routing Table =====
default via 172.17.0.1 dev eth0 
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2 

I have a subnet router that was advertising that same cidr. As soon as I connected Tailscale via actions, any docker based workload could no longer get to the DNS server inside the runner

Status: Downloaded newer image for nicolaka/netshoot:latest
===== DNS Configuration =====
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 168.63.129.16
search grvplcbrxqwulonopmolb0o12f.dx.internal.cloudapp.net tail9e93b.ts.net

# Based on host file: '/run/systemd/resolve/resolv.conf' (legacy)
# Overrides: []
\n===== Network Interfaces =====
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen [100](https://github.com/jaxxstorm/tailscale-actions-example/actions/runs/9829717610/job/27135072514#step:5:101)0
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host proto kernel_lo 
       valid_lft forever preferred_lft forever
5: eth0@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
\n===== Routing Table =====
default via 172.17.0.1 dev eth0 
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2 
\n===== DNS Resolution Test =====
;; communications error to 168.63.129.16#53: timed out
;; communications error to 168.63.129.16#53: timed out
;; communications error to 168.63.129.16#53: timed out
;; no servers could be reached


;; communications error to 168.63.129.16#53: timed out
;; communications error to 168.63.129.16#53: timed out
;; communications error to 168.63.129.16#53: timed out
;; no servers could be reached

jaxxstorm avatar Jul 07 '24 19:07 jaxxstorm

Still same though @tpanum.

mines looks like this

  • name: Set up Docker Buildx uses: docker/[email protected] with: buildkitd-config-inline: | [dns] nameservers=["100.78.xx.xx","100.78.xx.xx"] And errors out
#45 ERROR: failed to push git.local.puvvadi.net/***/blog:e234247: failed to do request: Head "https://git.local.puvvadi.net/v2/***/blog/blobs/sha256:bbba97e7b63ba8e2a28aa20a0a10b6ba491f29a395e8f7cc5bdf6ff4fe783000": dial tcp: lookup git.local.puvvadi.net on 168.63.129.16:53: no such host

@kdpuvvadi I had a similar issue and the solution was to set network=host on the Docker BuildX setup step

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
        with:
          driver-opts: |
            network=host

youp-augur avatar Jan 17 '25 16:01 youp-augur

@kdpuvvadi I had a similar issue and the solution was to set network=host on the Docker BuildX setup step

  - name: Set up Docker Buildx
    uses: docker/setup-buildx-action@v3
    with:
      driver-opts: |
        network=host

This is not reliable at all. Hit or miss. switched to self-hosted runners and local dns worked always.

kdpuvvadi avatar Jan 18 '25 04:01 kdpuvvadi

I'm having a similar issue trying to connect to a database via GH action through tailscale I sometimes get Temporary failure in name resolution error which indicates DNS issues. Sometimes I don't get it. But it's frustrating and holding our CI/CD pipeline.

ndekross avatar Jan 23 '25 21:01 ndekross

For buildx builds we had to put:

{"dns": ["100.100.100.100"]}

in:

/etc/docker/daemon.json

This is because buildx uses a container itself, so uses whatever DNS docker is configured for.

adamcharnock avatar Oct 20 '25 13:10 adamcharnock