buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

[v0.21] custom DNS nameservers require IP addresses

Open dancysoft opened this issue 7 months ago • 12 comments

Contributing guidelines and issue reporting guide

Well-formed report checklist

  • [x] I have found a bug that the documentation does not mention anything about my problem
  • [x] I have found a bug that there are no open or closed issues that are related to my problem
  • [x] I have provided version/information about my environment and done my best to provide a reproducer

Description of bug

Bug description

After upgrading from buildkitd 0.20.0 to 0.21.0, some of our buildkitd installations always fail to build:

$ buildctl ... \
  --frontend=gateway.v0 \
  --opt source=docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v0.23.0 \
  --opt filename=.pipeline/blubber.yaml \
  --opt target=test \
  --local context=. \
  --local dockerfile=.
2025-05-29 15:05:23,279 Using build frontend docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v0.23.0
#1 resolve image config for docker-image://docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v0.23.0
#1 DONE 0.2s
#2 docker-image://docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v0.23.0@sha256:6b1535a39497bb6c5e0a733595721a91cee33dba99ab59d8323d077665073a53
#2 resolve docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v0.23.0@sha256:6b1535a39497bb6c5e0a733595721a91cee33dba99ab59d8323d077665073a53 0.0s done
#2 CACHED
error: failed to solve: ParseAddr("ns-recursor.openstack.eqiad1.wikimediacloud.org"): unexpected character (at "ns-recursor.openstack.eqiad1.wikimediacloud.org")

The problem goes away when buildkitd is downgraded to 0.20.0.

We have two different clusters of buildkitd's. Both clusters were upgraded to 0.21.0 at the same time and one of the clusters builds fine, and the other has this problem.

Notes:

  • ParseAddr is expecting an IP address string, not a domain name.
  • buildkitd.toml has:
[dns]
nameservers = ["ns-recursor.openstack.eqiad1.wikimediacloud.org"]

Example Job log: https://gitlab.wikimedia.org/dancy/deleteme/-/jobs/522422

Version information

This starting happening with buildkitd 0.21.0 and persists in 0.22.0. The version 0.20.0 does not exhibit this behavior.

dancysoft avatar May 29 '25 20:05 dancysoft

If someone can provide advice on how to collect a full stack trace (from buildkitd, not buildctl) when this error occurs, I'll try it out.

dancysoft avatar May 29 '25 20:05 dancysoft

I ended up finding the following in buildkitd's config file (not sure why I didn't look there first):

[dns]
nameservers = ["ns-recursor.openstack.eqiad1.wikimediacloud.org"]

dancysoft avatar May 29 '25 22:05 dancysoft

Oops, I didn't mean to close this issue. Anyway, there is a change of behavior between the aforementioned buildkitd versions. I don't know if that's something yall want to fix. In the meantime I'll make sure we use an IP address string.

dancysoft avatar May 29 '25 22:05 dancysoft

Thanks @thaJeztah

dancysoft avatar May 29 '25 23:05 dancysoft

@thaJeztah Seems to be regression from https://github.com/moby/moby/commit/00bd916203d01831bea2173ead6cd6736b53a877#diff-a8ba0929fdc2848f37b485f438bcd8923d63d07db6979f108aabc611aa8707f4R7-R131

cc @robmry

tonistiigi avatar May 30 '25 22:05 tonistiigi

A hostname can't be a nameserver, a nameserver address is needed before a name can be resolved into an address. So I guess DNS from build containers running on that host just didn't work before? (Or an IP address got picked up from somewhere else?)

Failing on misconfiguration seems better than silently ignoring it. But, perhaps it could be reported when validating the config file, so the error message can be clearer about where the problem is?

(Or, perhaps the hostname should be resolved by buildkit, to get an address to use in the container's resolv.conf?)

robmry avatar May 31 '25 09:05 robmry

(Catching up) Ah! I see the issue now (didn't read in-depth when I reopened). Yes, looks indeed like before it would silently ignore the invalid configuration. So either buildkit is missing a validation step, or if it was intentional to allow a domain-name to be specified, I guess BuildKit should somehow resolve the domain before passing to to the code that writes the resolv.conf 🤔

(We should probably look if we can make the error message more informative though from the resolvconf code)

thaJeztah avatar Jun 02 '25 08:06 thaJeztah

I would have expected it to resolve it first in the daemon scope, as this was reported as a regression. I don't see where it would happen in https://github.com/moby/buildkit/issues/6001#issuecomment-2923597403 though. If it never worker then of course no need to add a special feature for it.

@dancysoft can you confirm if this is a regression or invalid conf receiving an error?

tonistiigi avatar Jun 02 '25 16:06 tonistiigi

Sorry for the late reply. I was off for a week.

I re-tested everything today with the following buildkitd.toml:

# Use CNI to isolate each build container network namespace
networkMode = "cni"

# Pre-allocate a pool of network namespaces
cniPoolSize = 20

[dns]
nameservers = ["one.one.one.one"]

Buildkitd is started like so:

docker network create deleteme
version=v0.20.0
docker run -d --name buildkitd --privileged \
       -v ./buildkitd.toml:/etc/buildkit/buildkitd.toml:ro \
       -p 1234:1234 \
       --network deleteme \
       moby/buildkit:$version \
       --addr tcp://0.0.0.0:1234 \
       --config \
       /etc/buildkit/buildkitd.toml

This configuration works fine when version=v0.20.0. I can successfully build buildkitd with is using the following command:

(The current directory is a git clone of the buildkit repo)

docker run --rm -it \
       --network deleteme \
       --entrypoint buildctl \
       -v .:/src:ro \
       moby/buildkit:v0.20.0 \
       --addr tcp://buildkitd:1234 build \
       --frontend dockerfile.v0 \
       --local context=/src \
       --local dockerfile=src \
       --progress=plain

If I change to version=v0.21.0 and restart the buildkitd container, the build fails:

dancy@base:~/src/wmf/buildkit$ ./test-build
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 18.32kB done
#1 DONE 0.1s

#2 resolve image config for docker-image://docker.io/docker/dockerfile-upstream:master
#2 DONE 1.3s

#3 docker-image://docker.io/docker/dockerfile-upstream:master@sha256:7a6acb5d355f1fdfa63b5930b6a03c1370ebd425c50d6c6c0861004fe4e247d6
#3 resolve docker.io/docker/dockerfile-upstream:master@sha256:7a6acb5d355f1fdfa63b5930b6a03c1370ebd425c50d6c6c0861004fe4e247d6 0.0s done
#3 sha256:93ad73e33b81ab605ab21198d4fe790d80d17e766ad941ecc527d61b9e22252d 0B / 14.08MB 0.2s
#3 sha256:93ad73e33b81ab605ab21198d4fe790d80d17e766ad941ecc527d61b9e22252d 6.29MB / 14.08MB 0.3s
#3 sha256:93ad73e33b81ab605ab21198d4fe790d80d17e766ad941ecc527d61b9e22252d 14.08MB / 14.08MB 0.4s done
#3 extracting sha256:93ad73e33b81ab605ab21198d4fe790d80d17e766ad941ecc527d61b9e22252d 0.1s done
#3 DONE 0.6s
Dockerfile:1
--------------------
   1 | >>> # syntax=docker/dockerfile-upstream:master
   2 |
   3 |     ARG RUNC_VERSION=v1.2.5
--------------------
error: failed to solve: ParseAddr("one.one.one.one"): unexpected character (at "one.one.one.one")

dancysoft avatar Jun 10 '25 16:06 dancysoft

Thanks @dancysoft - the error is definitely new, and its message will be more helpful in the next release (https://github.com/moby/moby/pull/50124).

But, we think a non-IP-address nameserver would have been silently ignored in older releases (not treated as a hostname and resolved, or anything like that). So, the build container wouldn't have been using the expected nameserver.

robmry avatar Jun 10 '25 17:06 robmry

Thanks @dancysoft - the error is definitely new, and its message will be more helpful in the next release (moby/moby#50124).

But, we think a non-IP-address nameserver would have been silently ignored in older releases (not treated as a hostname and resolved, or anything like that). So, the build container wouldn't have been using the expected nameserver.

I see. And I presume the following (previously overlooked) buildkitd log message is evidence of this?

time="2025-06-10T19:00:55Z" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers"

dancysoft avatar Jun 10 '25 19:06 dancysoft

Ah, yes - exactly! Thank you.

robmry avatar Jun 10 '25 19:06 robmry