unbound-docker icon indicating copy to clipboard operation
unbound-docker copied to clipboard

HEALTHCHECK passes even when unbound can't resolve addresses

Open tnyeanderson opened this issue 1 year ago • 3 comments

Describe the bug The health check does not actually ensure that unbound is able to resolve domain names. If the domain for the check (default cloudflare.com) has previously been queried successfully, the health check will be successful even if the container has no network access (and therefore is unable to perform as a recursive nameserver).

To Reproduce Steps to reproduce the behavior:

  1. Ensure the container will be able to reach the internet, then start it:
docker run -d --rm --name=unbound-test mvance/unbound:latest
  1. Make sure the healthcheck completes at least once (use CTRL+C to stop watching)
watch "docker inspect unbound-test | jq '.[0].State.Health'"
  1. Cut internet access for the container, either by unplugging the cable on the host, using a firewall, or whatever other method
  2. Try resolving any non-cached domain using unbound. It won't work:
docker exec unbound-test drill @127.0.0.1 duckduckgo.com
  1. Continue watching the healthcheck. Because the value is already cached by unbound, it returns no error as if everything is fine even though it is not

Expected behavior If the recursive nameserver can't resolve names, the HEALTHCHECK should fail.

Error messages Exactly... there are none! :)

Additional context With these as the default values, it could take a while for the healthcheck to realize it should fail:

cache-max-ttl: 86400
cache-min-ttl: 300

So far I see two ways to confront the problem. 1) Create another forward-zone: entry for the healthcheck domain with the cache disabled, maybe something like below:

forward-zone:
  name: "cloudflare.com."
  forward-no-cache: yes
  forward-addr: 1.1.1.1@853#cloudflare-dns.com
  forward-addr: 1.0.0.1@853#cloudflare-dns.com

Downside of this is that generating the extra forward-zone at runtime is a little more annoying, as the user might have changed upstream forward-addr or even the healthcheck domain per #111. All caching for the domain used for the healthcheck is also disabled, which unnecessarily increases load during normal user operations actually querying that domain. Also this is basically a "change the config for the test" situation which can drift to inaccuracy pretty easily. One benefit is that this would fix the problem even for people who have overloaded the healthcheck command, and doesn't affect the existing health check, just makes it accurate.

The other option is to 2) create a healthcheck script which clears the cache before performing the check, such as with unbound-control flush cloudflare.com

The downside of this option is that remote-control: has to be enabled, which right now it is explicitly disabled in the default config. I also think that with this method, the check should be run less frequently, perhaps every minute or two? Will have to test performance impact of regularly clearing a zone from the cache like that. It also will not work for people who have already overwritten the HEALTHCHECK on their own. I still think I prefer this option.

I am in the process of experimenting with this, but don't have a concrete answer yet. Any ideas?

tnyeanderson avatar Sep 08 '22 03:09 tnyeanderson

Working on implementing option 2 from above here

It works for the first healthcheck after internet is cut off for the container, but then after that it serves an empty result with no failure code :(

First check with no internet:

Error: error sending query: Could not send or receive, because of network error

Second check with no internet (looks like an empty result is cached and is never flushed for some reason):

;; ->>HEADER<<- opcode: QUERY, rcode: SERVFAIL, id: 56901
;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;; duckduckgo.com.  IN  A

;; ANSWER SECTION:

;; AUTHORITY SECTION:

;; ADDITIONAL SECTION:

;; Query time: 0 msec
;; SERVER: 127.0.0.1
;; WHEN: Thu Sep  8 04:35:51 2022
;; MSG SIZE  rcvd: 32

tnyeanderson avatar Sep 08 '22 16:09 tnyeanderson

Thanks for working on this. The default config is setup doesn't support unbound-control which is likely why the cache doesn't flush.

MatthewVance avatar Sep 13 '22 00:09 MatthewVance

Sorry, I should've explicitly mentioned that the example above is with remote-control enabled, which is why it works the first time the cache needs to be cleared. In other words, a successful healthcheck caches the upstream result, then internet cuts out, then next check clears cache and fails to resolve (correctly failing the healthcheck), then the next check should clear the cache but instead returns an empty (cached?) result (falsely passing the healthcheck).

EDIT: to get a clearer diff, use diff -urN 1.16.{2,3}

tnyeanderson avatar Sep 13 '22 05:09 tnyeanderson