unbound icon indicating copy to clipboard operation
unbound copied to clipboard

unbound becoming unreliable and stops responding under certain conditions

Open haarp opened this issue 3 years ago • 1 comments

Hello

Describe the bug

I'm frequently seeing unbound enter a state in which it will not respond reliably anymore. Once in this state, it will not recover on its own. This seems to happen directly after a restart or config reload with an element of chance.

I suspect this might be related to using either forward-tls-upstream, or having around 50k local-data entries for blocking malware domains. I can observe this issue on two separate but fairly similar OPNsense installs.

Symptoms once this bug is triggered:

  • The same query (even for a local-data record!) is rarely answered immediately, and usually only after 20 seconds or not at all in a seemingly random pattern.
  • unbound-control hangs indefinitely.
  • The unbound process is using 0-2% CPU while constantly in the sbwait state, as seen by top. (a properly functioning unbound is seen in the kqread state
  • unbound will not react to SIGTERM

To reproduce Steps to reproduce the behavior:

Starting unbound with the config below seems to be enough. There's a roughly 1 in 10 chance it'll trigger the bug after starting (but of course, never when you try to reproduce it :( ). The random nature also make a process of eliminiation difficult.

Expected behavior Fast and reliable responses to queries and unbound-control

System:

  • Unbound version: 1.16.2
  • OS: OPNsense 22.7 (FreeBSD 13.1)
  • unbound -V output:
Version 1.16.2

Configure line: --with-libexpat=/usr/local --with-ssl=/usr/local --enable-dnscrypt --disable-dnstap --with-libnghttp2 --enable-ecdsa --disable-event-api --enable-gost --with-libevent --with-pythonmodule=yes --with-pyunbound=yes ac_cv_path_SWIG=/usr/local/bin/swig LDFLAGS=-L/usr/local/lib --disable-subnet --disable-tfo-client --disable-tfo-server --with-pthreads --prefix=/usr/local --localstatedir=/var --mandir=/usr/local/man --infodir=/usr/local/share/info/ --build=amd64-portbld-freebsd13.1
Linked libs: libevent 2.1.12-stable (it uses kqueue), OpenSSL 1.1.1q  5 Jul 2022
Linked modules: dns64 python respip validator iterator
DNSCrypt feature available

Additional information

Unbound config

unbound.conf:##########################
unbound.conf:# Unbound Configuration
unbound.conf:##########################
unbound.conf:##
unbound.conf:# Server configuration
unbound.conf:##
unbound.conf:server:
unbound.conf:chroot: /var/unbound
unbound.conf:username: unbound
unbound.conf:directory: /var/unbound
unbound.conf:pidfile: /var/run/unbound.pid
unbound.conf:root-hints: /var/unbound/root.hints
unbound.conf:use-syslog: yes
unbound.conf:port: 53
unbound.conf:verbosity: 1
unbound.conf:extended-statistics: no
unbound.conf:log-queries: no
unbound.conf:hide-identity: yes
unbound.conf:hide-version: yes
unbound.conf:harden-referral-path: no
unbound.conf:do-ip4: yes
unbound.conf:do-ip6: yes
unbound.conf:do-udp: yes
unbound.conf:do-tcp: yes
unbound.conf:do-daemonize: yes
unbound.conf:so-reuseport: yes
unbound.conf:module-config: "validator iterator"
unbound.conf:cache-max-ttl: 86400
unbound.conf:cache-min-ttl: 180
unbound.conf:harden-dnssec-stripped: no
unbound.conf:serve-expired: no
unbound.conf:outgoing-num-tcp: 10
unbound.conf:incoming-num-tcp: 10
unbound.conf:num-queries-per-thread: 4096
unbound.conf:outgoing-range: 8192
unbound.conf:infra-host-ttl: 900
unbound.conf:infra-cache-numhosts: 50000
unbound.conf:unwanted-reply-threshold: 0
unbound.conf:jostle-timeout: 200
unbound.conf:msg-cache-size: 250m
unbound.conf:rrset-cache-size: 500m
unbound.conf:num-threads: 4
unbound.conf:msg-cache-slabs: 8
unbound.conf:rrset-cache-slabs: 8
unbound.conf:infra-cache-slabs: 8
unbound.conf:key-cache-slabs: 8
unbound.conf:auto-trust-anchor-file: /var/unbound/root.key
unbound.conf:prefetch: yes
unbound.conf:prefetch-key: yes
unbound.conf:# Interface IP(s) to bind to
unbound.conf:interface: 0.0.0.0
unbound.conf:interface: ::
unbound.conf:interface-automatic: yes
unbound.conf:# DNS Rebinding
unbound.conf:# For DNS Rebinding prevention
unbound.conf:#
unbound.conf:# All these addresses are either private or should not be routable in the global IPv4 or IPv6 internet.
unbound.conf:#
unbound.conf:# IPv4 Addresses
unbound.conf:#
unbound.conf:private-address: 0.0.0.0/8       # Broadcast address
unbound.conf:private-address: 10.0.0.0/8
unbound.conf:private-address: 100.64.0.0/10
unbound.conf:private-address: 127.0.0.0/8     # Loopback Localhost
unbound.conf:private-address: 169.254.0.0/16
unbound.conf:private-address: 172.16.0.0/12
unbound.conf:private-address: 192.0.2.0/24    # Documentation network TEST-NET
unbound.conf:private-address: 192.168.0.0/16
unbound.conf:private-address: 198.18.0.0/15   # Used for testing inter-network communications
unbound.conf:private-address: 198.51.100.0/24 # Documentation network TEST-NET-2
unbound.conf:private-address: 203.0.113.0/24  # Documentation network TEST-NET-3
unbound.conf:private-address: 233.252.0.0/24  # Documentation network MCAST-TEST-NET
unbound.conf:#
unbound.conf:# IPv6 Addresses
unbound.conf:#
unbound.conf:private-address: ::1/128         # Loopback Localhost
unbound.conf:private-address: 2001:db8::/32   # Documentation network IPv6
unbound.conf:private-address: fc00::/8        # Unique local address (ULA) part of "fc00::/7", not defined yet
unbound.conf:private-address: fd00::/8        # Unique local address (ULA) part of "fc00::/7", "/48" prefix group
unbound.conf:private-address: fe80::/10       # Link-local address (LLA)
unbound.conf:# Private domains (DNS Rebinding)
unbound.conf:include: /var/unbound/private_domains.conf
unbound.conf:# Access lists
unbound.conf:include: /var/unbound/access_lists.conf
unbound.conf:# Static host entries
unbound.conf:include: /var/unbound/host_entries.conf
unbound.conf:# DHCP leases (if configured)
unbound.conf:include: /var/unbound/dhcpleases.conf
unbound.conf:# Custom includes
unbound.conf:include: /var/unbound/etc/*.conf
unbound.conf:remote-control:
unbound.conf:    control-enable: yes
unbound.conf:    control-interface: 127.0.0.1
unbound.conf:    control-port: 953
unbound.conf:    server-key-file: /var/unbound/unbound_server.key
unbound.conf:    server-cert-file: /var/unbound/unbound_server.pem
unbound.conf:    control-key-file: /var/unbound/unbound_control.key
unbound.conf:    control-cert-file: /var/unbound/unbound_control.pem
access_lists.conf:access-control: 127.0.0.1/8 allow
access_lists.conf:access-control: ::1/64 allow
access_lists.conf:access-control: 192.168.1.1/24 allow
access_lists.conf:access-control: 2001:*censored*/64 allow
access_lists.conf:access-control: fe80::*censored*/64 allow
access_lists.conf:access-control: 192.168.2.20/24 allow
access_lists.conf:access-control: 10.0.8.1/24 allow
access_lists.conf:access-control: *censored*/32 allow
access_lists.conf:access-control: *censored*/32 allow
access_lists.conf:access-control: 2001:*censored*/64 allow
dhcpleases.conf:local-data-ptr: "192.168.1.177 foo.domain.net"
dhcpleases.conf:local-data: "foo.domain.net IN A 192.168.1.177"
dhcpleases.conf:local-data-ptr: "192.168.1.100 bar.domain.net"
dhcpleases.conf:local-data: "bar.domain.net IN A 192.168.1.100"
dhcpleases.conf:local-data-ptr: "192.168.1.178 baz.domain.net"
dhcpleases.conf:local-data: "baz.domain.net IN A 192.168.1.178"
host_entries.conf:local-zone: "domain.net" transparent
host_entries.conf:local-data-ptr: "127.0.0.1 localhost"
host_entries.conf:local-data: "localhost A 127.0.0.1"
host_entries.conf:local-data: "localhost.domain.net A 127.0.0.1"
host_entries.conf:local-data-ptr: "::1 localhost"
host_entries.conf:local-data: "localhost AAAA ::1"
host_entries.conf:local-data: "localhost.domain.net AAAA ::1"
host_entries.conf:local-data-ptr: "192.168.1.1 OPNsense.domain.net"
host_entries.conf:local-data: "OPNsense.domain.net A 192.168.1.1"
host_entries.conf:local-data: "OPNsense A 192.168.1.1"
host_entries.conf:local-data-ptr: "2001:*censored* OPNsense.domain.net"
host_entries.conf:local-data: "OPNsense.domain.net AAAA 2001:*censored*"
host_entries.conf:local-data: "OPNsense AAAA 2001:*censored*"
host_entries.conf:local-data: "OPNsense.domain.net A 192.168.2.20"
host_entries.conf:local-data: "OPNsense A 192.168.2.20"
host_entries.conf:local-data: "OPNsense.domain.net A 10.0.8.1"
host_entries.conf:local-data: "OPNsense A 10.0.8.1"
host_entries.conf:local-data: "OPNsense.domain.net A *censored*"
host_entries.conf:local-data: "OPNsense A *censored*"
host_entries.conf:local-data: "OPNsense.domain.net A *censored*"
host_entries.conf:local-data: "OPNsense A *censored*"
host_entries.conf:local-data: "OPNsense.domain.net AAAA 2001:*censored*"
host_entries.conf:local-data: "OPNsense AAAA 2001:*censored*"
host_entries.conf:local-data-ptr: "151.106.9.30 archive.is"
host_entries.conf:local-data: "archive.is IN A 151.106.9.30"
host_entries.conf:local-data: "home.domain.net IN A 192.168.1.200"
host_entries.conf:local-data: "nextcloud.domain.net IN A 192.168.1.200"
host_entries.conf:local-data-ptr: "192.168.1.200 mach.domain.net"
host_entries.conf:local-data: "mach.domain.net IN A 192.168.1.200"
private_domains.conf:# Set private domains in case authoritative name server returns a Private IP address
etc/dot.conf:# Forward zones over TLS
etc/dot.conf:server:
etc/dot.conf:  tls-cert-bundle: /etc/ssl/cert.pem
etc/dot.conf:forward-zone:
etc/dot.conf:  name: "."
etc/dot.conf:  forward-tls-upstream: yes
etc/dot.conf:  forward-addr: 9.9.9.10@853#dns.quad9.net
etc/dot.conf:  forward-addr: 149.112.112.10@853#dns.quad9.net
etc/dot.conf:  forward-addr: 2620:fe::10@853#dns.quad9.net
etc/dot.conf:  forward-addr: 2620:fe::fe:10@853#dns.quad9.net
etc/miscellaneous.conf:server:
etc/dnsbl.conf:local-data: "scam.com A 0.0.0.0"
etc/dnsbl.conf:local-data: "malware.com A 0.0.0.0"
etc/dnsbl.conf:....
etc/dnsbl.conf:.... 50k more lines like this
Unbound log at log verbosity 1

Starting from when unbound started into a buggy condition. First timeouts were observed around 05:02, but might've been earlier too. Not much suspicious in here.

<164>1 2022-09-24T05:00:12+02:00 OPNsense.domain.net unbound 22595 - [meta sequenceId="1"] PTR record already exists for home.domain.net(192.168.1.200)
<164>1 2022-09-24T05:00:12+02:00 OPNsense.domain.net unbound 22595 - [meta sequenceId="2"] PTR record already exists for nextcloud.domain.net(192.168.1.200)
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="3"] [22675:0] info: service stopped (unbound 1.16.2).
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="4"] [22675:0] info: server stats for thread 0: 6171 queries, 4385 answers from cache, 1786 recursions, 172 prefetch, 0 rejected by ip ratelimiting
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="5"] [22675:0] info: server stats for thread 0: requestlist max 11 avg 0.773749 exceeded 0 jostled 0
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="6"] [22675:0] info: average recursion processing time 0.192220 sec
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="7"] [22675:0] info: histogram of recursion processing times
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="8"] [22675:0] info: [25%]=0.0485053 median[50%]=0.108102 [75%]=0.217228
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="9"] [22675:0] info: lower(secs) upper(secs) recursions
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="10"] [22675:0] info:    0.000000    0.000001 163
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="11"] [22675:0] info:    0.001024    0.002048 1
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="12"] [22675:0] info:    0.004096    0.008192 2
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="13"] [22675:0] info:    0.008192    0.016384 5
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="14"] [22675:0] info:    0.016384    0.032768 166
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="15"] [22675:0] info:    0.032768    0.065536 228
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="16"] [22675:0] info:    0.065536    0.131072 505
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="17"] [22675:0] info:    0.131072    0.262144 410
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="18"] [22675:0] info:    0.262144    0.524288 202
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="19"] [22675:0] info:    0.524288    1.000000 66
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="20"] [22675:0] info:    1.000000    2.000000 15
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="21"] [22675:0] info:    2.000000    4.000000 23
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="22"] [22675:0] info: server stats for thread 1: 6309 queries, 4507 answers from cache, 1802 recursions, 201 prefetch, 0 rejected by ip ratelimiting
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="23"] [22675:0] info: server stats for thread 1: requestlist max 9 avg 0.651523 exceeded 0 jostled 0
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="24"] [22675:0] info: average recursion processing time 0.179294 sec
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="25"] [22675:0] info: histogram of recursion processing times
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="26"] [22675:0] info: [25%]=0.0454248 median[50%]=0.105229 [75%]=0.211412
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="27"] [22675:0] info: lower(secs) upper(secs) recursions
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="28"] [22675:0] info:    0.000000    0.000001 156
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="29"] [22675:0] info:    0.002048    0.004096 1
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="30"] [22675:0] info:    0.004096    0.008192 4
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="31"] [22675:0] info:    0.008192    0.016384 3
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="32"] [22675:0] info:    0.016384    0.032768 205
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="33"] [22675:0] info:    0.032768    0.065536 211
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="34"] [22675:0] info:    0.065536    0.131072 530
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="35"] [22675:0] info:    0.131072    0.262144 394
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="36"] [22675:0] info:    0.262144    0.524288 223
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="37"] [22675:0] info:    0.524288    1.000000 47
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="38"] [22675:0] info:    1.000000    2.000000 9
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="39"] [22675:0] info:    2.000000    4.000000 19
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="40"] [22675:0] info: server stats for thread 2: 6199 queries, 4477 answers from cache, 1722 recursions, 244 prefetch, 0 rejected by ip ratelimiting
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="41"] [22675:0] info: server stats for thread 2: requestlist max 10 avg 0.630214 exceeded 0 jostled 0
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="42"] [22675:0] info: average recursion processing time 0.173002 sec
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="43"] [22675:0] info: histogram of recursion processing times
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="44"] [22675:0] info: [25%]=0.0458928 median[50%]=0.107432 [75%]=0.215045
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="45"] [22675:0] info: lower(secs) upper(secs) recursions
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="46"] [22675:0] info:    0.000000    0.000001 166
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="47"] [22675:0] info:    0.001024    0.002048 1
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="48"] [22675:0] info:    0.008192    0.016384 5
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="49"] [22675:0] info:    0.016384    0.032768 184
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="50"] [22675:0] info:    0.032768    0.065536 186
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="51"] [22675:0] info:    0.065536    0.131072 499
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="52"] [22675:0] info:    0.131072    0.262144 391
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="53"] [22675:0] info:    0.262144    0.524288 225
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="54"] [22675:0] info:    0.524288    1.000000 41
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="55"] [22675:0] info:    1.000000    2.000000 10
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="56"] [22675:0] info:    2.000000    4.000000 14
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="57"] [22675:0] info: server stats for thread 3: 6212 queries, 4441 answers from cache, 1771 recursions, 224 prefetch, 0 rejected by ip ratelimiting
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="58"] [22675:0] info: server stats for thread 3: requestlist max 9 avg 0.703759 exceeded 0 jostled 0
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="59"] [22675:0] info: average recursion processing time 0.199373 sec
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="60"] [22675:0] info: histogram of recursion processing times
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="61"] [22675:0] info: [25%]=0.0517788 median[50%]=0.111778 [75%]=0.229123
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="62"] [22675:0] info: lower(secs) upper(secs) recursions
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="63"] [22675:0] info:    0.000000    0.000001 153
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="64"] [22675:0] info:    0.001024    0.002048 1
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="65"] [22675:0] info:    0.008192    0.016384 6
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="66"] [22675:0] info:    0.016384    0.032768 176
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="67"] [22675:0] info:    0.032768    0.065536 184
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="68"] [22675:0] info:    0.065536    0.131072 518
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="69"] [22675:0] info:    0.131072    0.262144 388
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="70"] [22675:0] info:    0.262144    0.524288 241
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="71"] [22675:0] info:    0.524288    1.000000 68
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="72"] [22675:0] info:    1.000000    2.000000 17
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="73"] [22675:0] info:    2.000000    4.000000 17
<30>1 2022-09-24T05:00:13+02:00 OPNsense.domain.net unbound 22675 - [meta sequenceId="74"] [22675:0] info:    4.000000    8.000000 2
<29>1 2022-09-24T05:00:20+02:00 OPNsense.domain.net unbound 79082 - [meta sequenceId="75"] [79082:0] notice: init module 0: validator
<29>1 2022-09-24T05:00:20+02:00 OPNsense.domain.net unbound 79082 - [meta sequenceId="76"] [79082:0] notice: init module 1: iterator
<30>1 2022-09-24T05:00:20+02:00 OPNsense.domain.net unbound 79082 - [meta sequenceId="77"] [79082:0] info: start of service (unbound 1.16.2).
<30>1 2022-09-24T05:00:20+02:00 OPNsense.domain.net unbound 79082 - [meta sequenceId="78"] [79082:3] info: generate keytag query _ta-4f66. NULL IN
<30>1 2022-09-24T05:00:20+02:00 OPNsense.domain.net unbound 79082 - [meta sequenceId="79"] [79082:1] info: generate keytag query _ta-4f66. NULL IN
<164>1 2022-09-24T05:00:21+02:00 OPNsense.domain.net unbound 93326 - [meta sequenceId="80"] PTR record already exists for home.domain.net(192.168.1.200)
<164>1 2022-09-24T05:00:21+02:00 OPNsense.domain.net unbound 93326 - [meta sequenceId="81"] PTR record already exists for nextcloud.domain.net(192.168.1.200)
<164>1 2022-09-24T05:02:19+02:00 OPNsense.domain.net unbound 73584 - [meta sequenceId="1"] PTR record already exists for home.domain.net(192.168.1.200)
<164>1 2022-09-24T05:02:19+02:00 OPNsense.domain.net unbound 73584 - [meta sequenceId="2"] PTR record already exists for nextcloud.domain.net(192.168.1.200)
<164>1 2022-09-24T05:02:39+02:00 OPNsense.domain.net unbound 89805 - [meta sequenceId="3"] PTR record already exists for home.domain.net(192.168.1.200)
<164>1 2022-09-24T05:02:39+02:00 OPNsense.domain.net unbound 89805 - [meta sequenceId="4"] PTR record already exists for nextcloud.domain.net(192.168.1.200)
<30>1 2022-09-24T10:04:54+02:00 OPNsense.domain.net unbound 79082 - [meta sequenceId="1"] [79082:3] info: generate keytag query _ta-4f66. NULL IN

I will next try starting into higher log verbosities. Right now it's sitll running in the buggy state, so if anyone wants me to check something, let me know asap.

Also see opnsense/core#6041

Thanks!

haarp avatar Sep 24 '22 14:09 haarp

I can't pinpoint to anything at this moment but this does sound like something that was addressed recently (blocking state of sockets). Could you try using the current master branch and see if it is still happening?

gthess avatar Sep 26 '22 08:09 gthess

Thanks, @gthess . unbound as of e93c75a has been running stable so far, however due the random nature of this bug this doesn't mean anything. I'll have to observe it a few more weeks to be sure.

haarp avatar Sep 30 '22 11:09 haarp

Thanks, keep us posted!

gthess avatar Oct 03 '22 08:10 gthess

Today the bug triggered again on e93c75a. So it's still present :(

Is there anything I can do to help diagnose once it triggers? Even at log verbosity 3 I'm not seeing anything useful.

haarp avatar Oct 06 '22 10:10 haarp

Verbosity 4 logs almost everything. Could you try? It would impact performance though if that is a concern.

gthess avatar Oct 07 '22 14:10 gthess

I finally managed to to reach the bugged state during a log verbosity = 4 run and collected some logs. I don't have an exact timestamp for when it started bugging out, and the full log is way too large to search for anomalies by hand. So I'm attaching a small excerpt instead.

Conditions:

  • Log verbosity 4
  • Bug was triggered roughly 15 minutes before the excerpt was taken
  • For diagnostics purposes, I was querying the same domain 10x per second, waiting for a reply inbetween (that's these frequent info: receive_udp on interface: 192.168.1.1). In the bugged state, unbound will answer a few of those, then stall for a while, then answer a few again, etc.

The log shows no reaction for over 10 seconds at a time, e.g. between line 4381 and 4382, despite my query waiting for a reply. Occasionally, some errors can be seen, such as tcp error for address ip4 149.112.112.10 port 853 (len 16). I can't make much of this log. Maybe you can see more?

unbound-bugged.log.gz

Cheers!

haarp avatar Oct 12 '22 23:10 haarp

From a first look I don't see anything Unbound-related standing out. The tcp error for address ... messages seem fine to me; probably the upstream is closing the connection and Unbound takes appropriate action.

What stands out to me though is that you mention you issue 10x the same query per second. When Unbound sees that you can see that it replies straight from the cache by the following consecutive log lines:

info: receive_udp on interface: 192.168.1.1  # This is the incoming query
debug: using localzone . transparent         # There is a localzone configuration
info: send_udp over interface: 192.168.1.1   # The answer is sent back with no further explicit information; cache hit.

Three things of notice:

  1. I don't see 10x received queries in Unbound each second, even during times of Unbound "idleness" (not logging anything else), and
  2. The more than 10 second delay you mention on line 4381, Unbound does not seem to be doing anything in particular before the gap, and
  3. Queries are arriving on the 192.168.1.1 interface; not local.

These make me believe (for lack of noticing something more specific) that there may be a problem with the network from the testing machine to Unbound.

Can I suggest that you repeat your test exactly the same way but also have a client running on the same Unbound machine that will ask the same questions on 127.0.0.1? Then if the problem is still with Unbound both those test clients should see the same unresponsiveness. Does this makes sense? Or I misjudged your environment completely?

gthess avatar Oct 14 '22 19:10 gthess

Thanks for having a look!

  1. I don't see 10x received queries in Unbound each second, even during times of Unbound "idleness" (not logging anything else)
  2. The more than 10 second delay you mention on line 4381, Unbound does not seem to be doing anything in particular before the gap, and

This is correct, as my query tool (just dig in a loop) will wait for a response before sending the next query. unbound sometimes instantly answers several queries in a row, and sometimes stalls. dig times out after 15 seconds, so at worst there are 15s between each submitted queries, and at best 0.1s.

unbound was indeed not doing anything during the log gap, i.e. was somehow completely stalled.

These make me believe (for lack of noticing something more specific) that there may be a problem with the network from the testing machine to Unbound.

No, I don't think this is the case, because:

  • Communication between both machines is responsive, I am SSHed into the unbound machine to retrieve the logs for example. Only DNS queries are affected.
  • In the bugged state, unbound will not react to SIGTERM or to unbound-control (from within the local machine), which indicates that the problem is within the unbound process. The process also mostly stays in the sbwait instead of kqread state.
  • Local queries from within the same machine show the same behavior

haarp avatar Oct 15 '22 20:10 haarp

I'm also seeing unbound randomly taking way too long to reply to queries, and sometimes it straight up fails and needs a restart.

I'm running unbound + stubby on Archlinux, and there's a huge list of local-data entries to block ad domains. This only started happening after I added the ad domains blocklist.

txtsd avatar Oct 22 '22 00:10 txtsd

and there's a huge list of local-data entries to block ad domains. This only started happening after I added the ad domains blocklist.

Be sure to give unbound time to load the list during start. This can take a while. The startup behavior (queries answered slowly/time out) is very similar to this bug, but unlike this bug, unbound will recover after it finishes loading the list.

haarp avatar Oct 22 '22 09:10 haarp

It doesn't tend to happen right after a restart, but at least 30 minutes later. The only solution to get it working again is for me to restart it.

txtsd avatar Oct 22 '22 10:10 txtsd

Interesting. Does it react to unbound-control in this state? Does SIGTERM work or do you need to SIGKILL it?

haarp avatar Oct 22 '22 10:10 haarp

Interesting. Does it react to unbound-control in this state? Does SIGTERM work or do you need to SIGKILL it?

I will try when it happens again. I usually just systemctl restart unbound.service.

txtsd avatar Oct 22 '22 12:10 txtsd

Same here on opnsense 22.7.6 with unbound 1.16.3 on FreeBSD:13:amd64. It was hanging a couple of weeks ago, but I just restarted the system. Now unbound is not reacting again. unbound-control seems to hang as I don't see any stats in the opnsense webui and several unbound-control processes seem to hang:

ps -aux | grep unbound
root             4517   0.0  0.0   13504  2040  -  I    07:04       0:00.03 /bin/sh /usr/local/opnsense/scripts/unbound/cache.sh dump
root             5069   0.0  0.3   23144 12204  -  I    07:04       0:00.03 /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf dump_cache
root             6546   0.0  0.0   13504  2032  -  I    Fri07       0:00.03 /bin/sh /usr/local/opnsense/scripts/unbound/cache.sh dump
root             7211   0.0  0.3   23144 12204  -  I    Fri07       0:00.03 /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf dump_cache
root            12031   0.0  0.1   13504  2124  -  I    13:37       0:00.04 /bin/sh /usr/local/opnsense/scripts/unbound/cache.sh dump
root            12785   0.0  0.3   23144 12368  -  I    13:37       0:00.03 /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf dump_cache
root            25724   0.0  0.3   24000 12640  -  I    13:33       0:00.53 /usr/local/bin/python3 /usr/local/opnsense/scripts/unbound/wrapper.py -s (python3.9)
root            29603   0.0  0.3   23144 12368  -  I    13:33       0:00.04 /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf stats_noreset
root            35130   0.0  0.3   24000 12640  -  I    13:34       0:00.47 /usr/local/bin/python3 /usr/local/opnsense/scripts/unbound/wrapper.py -s (python3.9)
root            37134   0.0  0.3   23144 12368  -  I    13:34       0:00.04 /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf stats_noreset
root            50210   0.0  0.0   13504  2032  -  I    07:02       0:00.02 /bin/sh /usr/local/opnsense/scripts/unbound/cache.sh dump
root            51037   0.0  0.3   23144 12204  -  I    07:02       0:00.03 /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf dump_cache
root            77647   0.0  0.0   13504  2032  -  I    07:04       0:00.04 /bin/sh /usr/local/opnsense/scripts/unbound/cache.sh dump
root            78238   0.0  0.3   23144 12204  -  I    07:04       0:00.03 /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf dump_cache
root            78702   0.0  0.3   24000 12640  -  I    13:34       0:00.58 /usr/local/bin/python3 /usr/local/opnsense/scripts/unbound/wrapper.py -s (python3.9)
root            82675   0.0  0.0   12900  1904  -  Is   Fri07       0:00.01 /usr/local/bin/flock -n -E 0 -o /tmp/unbound_start.lock /usr/local/opnsense/scripts/unbound/start.s
root            82714   0.0  0.0   13504  2036  -  I    Fri07       0:00.01 /bin/sh /usr/local/opnsense/scripts/unbound/start.sh
unbound         84767   0.0  1.2   90188 50712  -  Is   Fri07       0:57.31 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root            84889   0.0  0.0   13504  2032  -  I    Fri07       0:00.01 /bin/sh /usr/local/opnsense/scripts/unbound/cache.sh load
root            86870   0.0  0.3   23144 12344  -  I    Fri07       0:00.13 /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf load_cache
root            87685   0.0  0.3   23144 12368  -  I    13:34       0:00.04 /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf stats_noreset
root            89081   0.0  0.0   13504  2032  -  I    Fri07       0:00.03 /bin/sh /usr/local/opnsense/scripts/unbound/cache.sh dump
root            90404   0.0  0.3   23144 12204  -  I    Fri07       0:00.03 /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf dump_cache
root            27236   0.0  0.0   12748  1952  0  S+   13:54       0:00.01 grep unbound

I have 38 local-data configured (so not as many as @haarp ) and the first time it failed I had 3 TLS forwarders configured. I switched this to a local dnscrypt running on the same system.

I tried to SIGTERM it, waited over 5 minutes and then SIGKILLed it:

 # kill -15 84767
 # ps -aux | grep 84767
unbound         84767   0.0  1.2   90188 50712  -  Is   Fri07       0:57.35 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root            13326   0.0  0.0   12748  1952  0  S+   13:57       0:00.01 grep 84767
 # ps -aux | grep 84767
unbound         84767   0.0  1.2   90188 50724  -  Is   Fri07       0:57.60 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root            60583   0.0  0.0   12748  1952  0  S+   14:03       0:00.01 grep 84767
 # kill -9 84767
 # ps -aux | grep 84767
root             2805   0.0  0.0   12748  1952  0  S+   14:04       0:00.01 grep 84767
 # ps -aux | grep unbound
unbound          5687   3.5  1.1   90188 44984  -  Ss   14:04       0:00.52 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root            78973   0.0  0.0   12748  1952  0  S+   14:04       0:00.01 grep unbound

All the unbound-control processes went away and the new process responds normal again. Let me know if/how I can gather more info now or when the next hang occurs (might take a couple of weeks again)

dicer avatar Oct 29 '22 12:10 dicer

Interesting. Call it a hunch, but I suspect TLS forwarding might be the culprit.

haarp avatar Nov 03 '22 20:11 haarp

Interesting. Does it react to unbound-control in this state? Does SIGTERM work or do you need to SIGKILL it?

unbound-control does reply in this state. Although loading cache entries into unbound is much slower when it happens. I have to dump cache (which is not slow), restart unbound, then load cache, and it functions normally again.

txtsd avatar Nov 04 '22 04:11 txtsd

unbound-control does reply in this state.

Hmm, seems like your case is different from this bug then.

haarp avatar Nov 04 '22 07:11 haarp

Hi @haarp, I am afraid I cannot reproduce it here. Wild guess, is OPNsense talking to Unbound? Via remote-control for example? If so, can you disable that and try again?

gthess avatar Nov 28 '22 10:11 gthess

Yes, they do communicate a lot. Opnsense starts unbound and interacts with it through unbound-control. It's possible that some of this contributes to whatever triggers this bug. There's no way of disabling it unfortunately.

I've been trying to reconstruct the startup process to see if I can trigger this bug and get a repro. It's tough. Is there a way to have unbound log the full control cmd? Currently it only shows e.g. info: control cmd: local_datas_remove but not the value.

haarp avatar Nov 29 '22 00:11 haarp

Unbound normally logs the full command. The local_datas_remove command is special along with (local_zones, local_zones_remove, local_datas, view_local_datas, view_local_datas_remove) which wait to read data from stdin. That is why there is no more logging about the data of the command afterwards. This is what may be hanging Unbound in your case. You could try and issue an unbound-control local_datas_remove command manually and see if you reproduce the same behavior you are describing. Of course the manually issued command in this case will just hang waiting for stdin.

gthess avatar Mar 17 '23 14:03 gthess

I have not seen this bug anymore in recent months on my firewall. opnSense 23.1.1, unbound 1.17.1

This might mean it's fixed, or that I or the distro have changed the conditions that triggered it. I any case, all is quiet for now. Feel free to close this at your discretion. Thanks for your time! :)

haarp avatar Mar 21 '23 14:03 haarp

Thanks for confirming! I am closing this but if you stumble upon it in the future feel free to reopen/tag me here.

gthess avatar Mar 21 '23 15:03 gthess

I'm seeing something very similar to this on opnsense 23.1.4_1 with unbound version 1.17.1_2. Specifically I have a trunk of 5 different vlans: DMZ, Management, GeneralInternet, IoT, and Guest.

Everything was working fine w/ just DMZ, Management and GeneralInternet and continues to stabilize if I limit myself to only those 3. As soon as I add either of the other 2 vlans I see frequent intermittent slow or failed DNS queries. It seems to happen on any vlan I am connected on, but the newly introduced networks seem to be worse.

I never see the failed queries show up in the unbound logs w/ level 3 debugging. I also see no indication that unbound has failed or anything.

helmespc avatar Mar 23 '23 01:03 helmespc

@helmespc Check for these 3 telltale signs:

  • unbound spends almost all its time in the sbwait state instead of kqread, as seen by top
  • unbound-control (e.g. unbound-control -c /var/unbound/unbound.conf status) never finishes/times out
  • unbound doesn't react to SIGTERM

haarp avatar Mar 23 '23 08:03 haarp

@helmespc Check for these 3 telltale signs:

* unbound spends almost all its time in the `sbwait` state instead of `kqread`, as seen by top

* unbound-control (e.g. `unbound-control -c /var/unbound/unbound.conf status`) never finishes/times out

* unbound doesn't react to SIGTERM

Thank you for the response, I tried all 3 of those items and everything seems to be nominal. Looks like whatever I'm seeing is something different...

helmespc avatar Mar 23 '23 17:03 helmespc