foundryvtt-docker icon indicating copy to clipboard operation
foundryvtt-docker copied to clipboard

Container doesn't populate resolv.conf properly

Open burnacid opened this issue 3 years ago • 22 comments

🐛 Bug Report

I'd like to supply my own dns server with the container using the --dns attribute. Yet this is not correctly picked up and inserted into the /etc/resolv.conf. This makes it impossible for the container to run in bridge networking mode

To Reproduce

Steps to reproduce the behavior:

  • Run the container using --dns 8.8.8.8
  • The contain will not run properly as it cannot resolve the DNS for foundryvtt.com

Expected behavior

I'm expecting the DNS to be populated into the /etc/resolv.conf. I don't know how ever why it isn't This works fine for all my other containers I'm running.

Any helpful log output

I use docker-compose

version: "3.3"

secrets:
  config_json:
    file: /share/Container/foundryvtt-secrets.json

services:
  foundry:
    image: felddy/foundryvtt:0.7.8
    hostname: foundryvtt
    mac_address: 24:5E:BE:00:00:F6
    dns:
    - 192.168.1.2
    - 8.8.8.8
    - 8.8.4.4
    networks:
        qnet-static-eth0-79e6cc:
            ipv4_address: 192.168.1.246
    volumes:
      - type: bind
        source: /share/Container/foundryvtt
        target: /data
    environment:
      - FOUNDRY_LICENSE_KEY=*
      - CONTAINER_CACHE=/data/container_cache
      - CONTAINER_PATCHES=/data/container_patches
    secrets:
      - source: config_json
        target: config.json
        
networks:
  qnet-static-eth0-79e6cc:
    external: true

Paste the results here:

Entrypoint | 2020-12-25 09:30:07 | [info] Starting felddy/foundryvtt container v0.7.8                                                                        
Entrypoint | 2020-12-25 09:30:07 | [info] Reading configured secrets from: /run/secrets/config.json                                                          
Entrypoint | 2020-12-25 09:30:09 | [info] No Foundry Virtual Tabletop installation detected.                                                                 
Entrypoint | 2020-12-25 09:30:09 | [info] Using FOUNDRY_USERNAME and FOUNDRY_PASSWORD to authenticate.                                                       
Authenticate | 2020-12-25 09:30:14 | [info] Requesting CSRF tokens from https://foundryvtt.com                                                               
Authenticate | 2020-12-25 09:30:19 | [error] Unable to authenticate: request to https://foundryvtt.com/ failed, reason: getaddrinfo EAI_AGAIN foundryvtt.com

The /etc/resolv.conf

nameserver 127.0.0.11
options ndots:0 

burnacid avatar Dec 25 '20 09:12 burnacid

I'm seeing the same issue running in Kubernetes. Might be related to this bug in Alpine. Edit: scratch that. I rebuilt using node:12-alpine3.10 and still had the problem.

jdmarble avatar Jan 03 '21 00:01 jdmarble

I ported to node:12-slim to successfully work around the problem. I'm running into a lot of DNS issues on alpine based images. Not sure if it's my k8s cluster's configuration, or what.

jdmarble avatar Jan 03 '21 21:01 jdmarble

Thanks for the research on this. I'm not entirely against switching the base image from Alpine to Debian. I'd like give upstream a bit of time to resolve this before jumping ship.

@jdmarble what was the impact to the image size using Debian-slim?

felddy avatar Jan 06 '21 17:01 felddy

I expected the Debian (even slim) based image to be larger than the Alpine one. I was surprised, although I'm not sure I can trust the results because I don't understand them. I'm getting different numbers depending on the source.

$ podman image ls
REPOSITORY                                      TAG            IMAGE ID      CREATED      SIZE
registry.gitlab.com/jdmarble/foundryvtt-docker  develop        6ad53b690aeb  3 days ago   106 MB
docker.io/felddy/foundryvtt                     latest         e3706094d2a7  2 weeks ago  111 MB

The Gitlab repo reports for my "slim" spin: 32.56 MiB (edit: 34.14MB) Your image size badge reports: 34 MB Docker Hub reports the compressed image size for felddy/foundryvtt as 33.92 MB. I tried pushing my image to Docker Hub to get an apples-to-apples, but it's taking a while to show up.

Maybe podman is reporting uncompressed size?

Regardless, I wouldn't suggest something as drastic as a base image change only to fix this type of problem, but if a slightly smaller image size is interesting (if it's true). :)

jdmarble avatar Jan 07 '21 03:01 jdmarble

I think I'm being affected by this issue too, but in the weirdest way I could imagine. I've spent the last 4 hours debugging and searching lol. I'm spinning this up in Kubernetes.

Started when I got errors talking about rejected certs during the download process. I managed to get a shell into a container, and voila!

(all four commands ran in quick succession) image

The 404s are from my on public facing traefik instance, and then it eventually curls correctly, randomly. The next request was back to the 404s

I'm going to try building the image myself from different bases like @jdmarble did, but this is just an impact report I guess

Edit Bless you jdmarble you forked and pushed your port. May the coding gods smile upon you

adam8797 avatar Feb 27 '21 06:02 adam8797

Update: Looks like that was unsuccessful. I was able to build the image successfully, but I still have the same problem. Sorry for the noise. Considering this may be unrelated, I can move my information to another ticket if you prefer.

adam8797 avatar Feb 27 '21 06:02 adam8797

I also have this networking issue in my k3s cluster. @jdmarble's repo worked :D

annonch avatar Mar 14 '21 19:03 annonch

I had hoped upstream would have fixed this issue in busybox, but that doesn't seem to be happening. Also, this is starting to affect more people.

I have started a branch using the node:14-slim base image: https://github.com/felddy/foundryvtt-docker/tree/improvement/debian

I'm a little concerned about the size increase (but it is not a show stopper):

❱ docker images | grep foundry
felddy/foundryvtt                       0.7.9-slim        ce29f9a2bc03   44 minutes ago      195MB
felddy/foundryvtt                       0.8.0             f676a803cfcb   3 weeks ago         126MB
felddy/foundryvtt                       release           e3706094d2a7   2 months ago        103MB
felddy/foundryvtt                       release-0.7.9     38a78b0459a4   2 months ago        103MB

The bigger issue that I need to resolve is that only half of the architectures supported by Alpine are offered by Debian:

os/arch node:14-alpine node:14-slim
linux/amd64
linux/arm/v6
linux/arm/v7
linux/arm64/v8
linux/ppc64le
linux/s390x

I don't have any idea how many users this would impact. I'd guess that loss of arm/v6 would be the biggest impact. I know a good number of people run Foundry on Raspberry Pis and this would remove support for the RPi 1 B and RPi 1 B+.

In any case, if you'd like to test the image from this branch it is available to be pulled as felddy/foundryvtt:improvement-debian. I would appreciate any feedback from the folks on this issue since I don't have a K8s cluster readily available.

If you have any comments about the limited architectures, that would also be helpful.

felddy avatar Mar 15 '21 16:03 felddy

Could I also get folks to try running this and posting the results. I'm unable to reproduce the behavior here, and want to verify that it hasn't been fixed upstream:

❱ docker run -it --rm --dns 8.8.8.8 node:14-alpine nslookup foundryvtt.com
Server:		8.8.8.8
Address:	8.8.8.8:53

Non-authoritative answer:

Non-authoritative answer:
Name:	foundryvtt.com
Address: 44.234.61.225

felddy avatar Mar 16 '21 15:03 felddy

Sure thing. I'll test it this evening (or possibly tomorrow if I run out of time) and I'll post back here

adam8797 avatar Mar 16 '21 15:03 adam8797

In case this is helpful. I noticed that the felddy/foundryvtt:improvement-debian worked fine, however the following are errors in felddy/foundryvtt:latest

Entrypoint | 2021-03-16 16:15:16 | [debug] Timezone set to: UTC
Entrypoint | 2021-03-16 16:15:16 | [info] Starting felddy/foundryvtt container v0.7.9
Entrypoint | 2021-03-16 16:15:16 | [debug] CONTAINER_VERBOSE set.  Debug logging enabled.
Entrypoint | 2021-03-16 16:15:16 | [info] No Foundry Virtual Tabletop installation detected.
Entrypoint | 2021-03-16 16:15:16 | [info] Using FOUNDRY_USERNAME and FOUNDRY_PASSWORD to authenticate.
Authenticate | 2021-03-16 16:15:16 | [debug] Saving cookies to: cookiejar.json
Authenticate | 2021-03-16 16:15:16 | [info] Requesting CSRF tokens from https://foundryvtt.com
Authenticate | 2021-03-16 16:15:16 | [debug] Fetching: https://foundryvtt.com
Authenticate | 2021-03-16 16:15:16 | [error] Unable to authenticate: request to https://foundryvtt.com/ failed, reason: getaddrinfo ENOTFOUND foundryvtt.com

Results Locally

Unable to find image 'node:14-alpine' locally
14-alpine: Pulling from library/node
e95f33c60a64: Pull complete 
0f691a8bb887: Pull complete 
daf9b71c0a0d: Pull complete 
d92a928c7b7d: Pull complete 
Digest: sha256:a75f7cc536062f9266f602d49047bc249826581406f8bc5a6605c76f9ed18e98
Status: Downloaded newer image for node:14-alpine
Server:         8.8.8.8
Address:        8.8.8.8:53

Non-authoritative answer:
Name:   foundryvtt.com
Address: 44.234.61.225

Non-authoritative answer:

inside k3s: (yaml included) (this also worked setting the dns server to 8.8.8.8)

apiVersion: batch/v1
kind: Job
metadata:
  name: hello
spec:
  template:
    # This is the pod template
    spec:
      containers:
      - name: dns-test
        image: node:14-alpine
        command: ['nslookup', 'foundryvtt.com']
      restartPolicy: OnFailure
---

Server:         10.43.0.10
Address:        10.43.0.10:53

Non-authoritative answer:

Non-authoritative answer:
Name:   foundryvtt.com
Address: 44.234.61.225

annonch avatar Mar 16 '21 16:03 annonch

@annonch Those are promising results.

When you get a chance could you check if the nightly build is exhibiting the same behavior as the last release: felddy/foundryvtt:nightly

If node:14-alpine is working, I'd expect that felddy/foundryvtt:nightly should work as well.

🤞

felddy avatar Mar 18 '21 15:03 felddy

Unfortunately I can't provide such good results. I'm running these in kubernetes

I just curled the foundry website to to test resolution. Here I used wc to condense the output. But the 4 word result is the bad DNS resolution, the 698 word result is the proper web page.

I tried this against improvement-debain but the behavior is still there:

weird_dns_behavior_1

And against nightly it was all 4s.. I didn't get a single good hit to the foundry website.

Now, if I'm the only one here I'm willing to concede that its just my setup, this may be unrelated, and I'm just making noise 😆

I can work around by setting my DNS policy to None and manually assigning DNS servers.

adam8797 avatar Mar 19 '21 01:03 adam8797

@adam8797 how are you running your K8S I never had an issue with the DNS resolution using the alpine container.

I did a 1000 requests in a row using @felddy example command and they all came out clean.

I know that with K8S sometimes policies or security groups if you are using in AWS can result in some inconsistent DNS resolutions. I'm running foundry today in RPi4 with k3s, local with composer, and in a server with KIND and k8s for development and testing.

If you guys have any other set of tests that I could run please let me know.

hugoprudente avatar Apr 04 '21 18:04 hugoprudente

I am also having this issue on a k8s cluster setup via kubeadm. This is the only container exhibiting the behavior and does so on both nightly and release-0.7.9. Not sure if it matters, but my k8s cluster is using CoreDNS and not kube-dns.

aetaric avatar Apr 13 '21 05:04 aetaric

Hi @aetaric how the network on your clusters are configured, I saw problems with k8s and CoreDNS naming resolution due to the security groups and firewalls connections between the nodes.

On all my environments I never had issues and my K8S development that runs on AWS with EKS also has CoreDNS and doesn't have the problem.

hugoprudente avatar Apr 13 '21 12:04 hugoprudente

Well, I am using flannel as the backing network fabric. So no network policy antics should be going on. I am running in vxlan mode for communication between nodes so that might have something to do with it?

As for physical and logical networking, all k8s nodes are same VLAN, same ToR switch, same subnet.

As I mentioned before, other containers are able to resolve DNS without issue and improvement-debain does seem to work, if not perfectly, well enough for the container to pull the app distribution and license info.

aetaric avatar Apr 13 '21 20:04 aetaric

So I might have some insight into what the container is doing weird here. I was reviewing my DNS query logs and it seems the container is appending the search domain from DHCP options to the foundry address.

Got query for foundryvtt.com.k8s.domain.tld|A from 192.168.9.254:8517, relayed to 192.168.8.100:53 Got query for foundryvtt.com.k8s.domain.tld|AAAA from 192.168.9.254:62140, relayed to 192.168.8.100:53

aetaric avatar Apr 14 '21 04:04 aetaric

I've resolved the DNS issue I've been having while running this and other Alpine based images in Kubernetes clusters on my network.

Short answer: I turned off DNSSEC for my domain name managed by Cloudflare and everything started working.

Read on for details.

Some information about my setup:

  • I use Cloudflare DNS to setup DNS TXT entries for letsencrypt so that my internal only servers can browser trusted certificates.
  • I don't use Cloudflare DNS for normal (A, AAAA, etc...) DNS records for my internal domain. I have an internal, Unbound DNS service for that.
  • Crucially, I had DNSSEC enabled for my internal domain in the Cloudflare DNS settings. I must have enabled it when I had different plans for that domain.

Some general information about what causes the problem for me (and possibly for you):

  • When Kubernetes starts a container, it adds search domains and options ndots:5 to /etc/resolve.conf inside the container
    • It copies the search domains from the host (my local domain, say, mylocaldomain.tld in my case) and adds a bunch of Kubernetes specific ones like cluster.local and svc.cluster.local.
    • This resolve.conf configuration has to do with looking up local services inside the cluster.
    • Aside: you can also override ndots to be "1" in each pod spec to solve the problem in another way
  • Now, when a DNS lookup for, say, foundryvtt.com is performed inside of a container, all of those search domains are checked first. For example, foundryvtt.com.svc.cluster.local then foundryvtt.com.cluster.local and foundryvtt.com.mylocaldomain.tld. Finally, if none of those other domains "resolve", then foundryvtt.com is checked.
    • The ...cluster.local domains are rejected by CoreDNS inside of the cluster, I guess. No beef with those.
    • foundryvtt.com.mylocaldomain.tld escapes the cluster and gets to my internal Unbound DNS server.
    • Unbound doesn't recognize it, so passes it, transparently, to another DNS server (8.8.8.8, Google's public DNS in my case).
      • Maybe I should configure Unbound to reject anything with that base domain that it doesn't recognize?
    • That DNS server recognizes the mylocaldomain.tld part and asks Cloudflare how to resolve it because Cloudflare is the authority on that particular domain.
    • Cloudflare would normally respond with NXDOMAIN, which, I guess (not a DNS expert here) means "doesn't exist". Instead, because I had DNSSEC enabled, it responds with NOERROR, but doesn't respond with an actual IP address. This is something like "I can neither confirm nor deny the existence of that or related domains". Read here about how Cloudflare justifies that response.
    • That "no comment" response winds its way back to the original requestor. Any non musl-based DNS client library would then shrug and continue looking through the search domains until it got to the implied '.' and tried 'foundryvtt.com' with a happy ending. musl will stop looking after recieving a NOERROR. Read here about how musl justifies that response.

Here are some links that helped me figure this out:

I could verify that this was a problem and that my fix worked using alpine/git and dig.

Before fix:

[jdmarble@jdmarble-desktop ~]$ kubectl run alpine-git --image=alpine/git --restart=Never -it --rm clone https://github.com/octocat/Spoon-Knife.git
fatal: unable to access 'https://github.com/octocat/Spoon-Knife.git/': Could not resolve host: github.com
...

(note that github.com did not resolve inside an Alpine based container inside of the cluster)

[jdmarble@jdmarble-desktop ~]$ dig github.com.mylocaldomain.tld
...
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26637
...
;; AUTHORITY SECTION:
mylocaldomain.tld.		1720	IN	SOA	cleo.ns.cloudflare.com. dns.cloudflare.com. ...
...

(note the NOERROR response)

After fix:

[jdmarble@jdmarble-desktop ~]$ kubectl run alpine-git --image=alpine/git --restart=Never -it --rm clone https://github.com/octocat/Spoon-Knife.git
Cloning into 'Spoon-Knife'...
remote: Enumerating objects: 16, done.
remote: Total 16 (delta 0), reused 0 (delta 0), pack-reused 16
Receiving objects: 100% (16/16), done.
Resolving deltas: 100% (3/3), done.

(note that github.com resolved inside an Alpine based container inside of the cluster)

[jdmarble@jdmarble-desktop ~]$ dig github.com.myinternaldomain.tld
...
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 56469
...
;; AUTHORITY SECTION:
myinternaldomain.tld.		1044	IN	SOA	cleo.ns.cloudflare.com. dns.cloudflare.com. ...
...

(note the NXDOMAIN response)

In my case, it was an easy decision to disable DNSSEC because the domain is only used internally and I'm not using Cloudflare for normal records. If you want to keep DNSSEC on, you may have to get creative or switch away from Cloudflare.

jdmarble avatar Jun 13 '21 05:06 jdmarble

I have upgraded my K8S Cluster to 1.22 and first time I got this error.

Just to let registered here the fix for me was ensure that CoreDNS was sending the resolution to a external resolver adding the 8.8.8.8 to the ConfigMap

Data
====
Corefile:
----
.:53 {
    errors
    health {
       lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
       ttl 30
    }
    prometheus :9153
    forward . 8.8.8.8 /etc/resolv.conf {
       max_concurrent 1000
    }
    cache 30
    loop
    reload
    loadbalance
}

hugoprudente avatar Aug 21 '21 19:08 hugoprudente

I have not been able to fixt his yet but I suspect this may be an issue with core DNS.

Lookups for foundryvtt.com appear to be failing because passthrough does not seem to be working

from coredns logs

[INFO] 10.1.182.28:51321 - 64102 "A IN foundryvtt.com.svc.cluster.local. udp 50 false 512" NXDOMAIN qr,aa,rd 143 0.000390493s
[INFO] 10.1.182.28:51321 - 41623 "A IN foundryvtt.com.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000535954s
[INFO] 10.1.182.28:51321 - 17998 "A IN foundryvtt.com.local. udp 38 false 512" SERVFAIL qr,rd,ra 113 0.03611267s

no lookups for foundryvtt.com though.

BitRacer avatar Jan 12 '22 19:01 BitRacer

I'll test again this on my 3 k8s clusters with the Alpine image (my default), and update here and in the other thread too. I'm still have the 8.8.8.8 on my CoreDNS so I'll try both, and edit this post

My 3 clusters runs today K8S Version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:45:37Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T20:01:24Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

Runnin CoreDNS k8s.gcr.io/coredns/coredns:v1.8.4

➜ k describe replicaset coredns-78fcd69978 -n kube-system
Name:           coredns-78fcd69978
Namespace:      kube-system
Selector:       k8s-app=kube-dns,pod-template-hash=78fcd69978
Labels:         k8s-app=kube-dns
                pod-template-hash=78fcd69978
Annotations:    deployment.kubernetes.io/desired-replicas: 2
                deployment.kubernetes.io/max-replicas: 3
                deployment.kubernetes.io/revision: 1
Controlled By:  Deployment/coredns
Replicas:       2 current / 2 desired
Pods Status:    2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           k8s-app=kube-dns
                    pod-template-hash=78fcd69978
  Service Account:  coredns
  Containers:
   coredns:
    Image:       k8s.gcr.io/coredns/coredns:v1.8.4
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
  Volumes:
   config-volume:
    Type:               ConfigMap (a volume populated by a ConfigMap)
    Name:               coredns
    Optional:           false
  Priority Class Name:  system-cluster-critical
Events:                 <none>

Confirmed with the same

Authenticate | 2022-01-24 19:52:07 | [error] Unable to authenticate: request to https://foundryvtt.com/auth/login/ failed, reason: getaddrinfo EAI_AGAIN foundryvtt.com

I have found something interesting that may solve the issue.

Though the call to dns.lookup() will be asynchronous from JavaScript's perspective, it is implemented as a synchronous call to getaddrinfo(3) that runs on libuv's threadpool. This can have surprising negative performance implications for some applications, see the UV_THREADPOOL_SIZE documentation for more information. and from:

https://nodejs.org/api/cli.html#cli_uv_threadpool_size_size more here: https://medium.com/@amirilovic/how-to-fix-node-dns-issues-5d4ec2e12e95

This solved my issue running 200 deployments.

hugoprudente avatar Jan 17 '22 12:01 hugoprudente

This issue has been automatically marked as stale because it has been inactive for 28 days. To reactivate the issue, simply post a comment with the requested information to help us diagnose this issue. If this issue remains inactive for another 7 days, it will be automatically closed.

github-actions[bot] avatar Sep 21 '22 10:09 github-actions[bot]

This issue has been automatically closed due to inactivity. If you are still experiencing problems, please open a new issue.

github-actions[bot] avatar Sep 28 '22 10:09 github-actions[bot]