unbound
unbound copied to clipboard
unbound kind of "loops" (100% CPU time) and is no longer reponsive
Describe the bug Since version 1.20.0 unbound starts using 100% CPU time after a few hours, sometimes even 200% and more, and the DNS answers take a looong time. After some further time DNS requests don't get answered at all any more. A regular "kill" does not stop the process, it needs to be "kill -9". Attached you can see the output of "top" and also from "perf top -p <-pid-of-unbound>".
The first "unbound" entries in "perf top" show "lruhash_lookup" and "rbtree_find_less_equal".
The amount of DNS requests is limited using iptables and hashlimit (150 per/minute with a burst of 45).
This behaviour is also with the current master source, 1.20.1 right now.
To reproduce Steps to reproduce the behavior:
- see above.
Expected behavior Unbound should be responsive all the time, and not looping after a few hours (like every 6 to 12 hours).
System:
- Unbound version: 1.20.0 and 1.20.1 (current master)
- OS: Debian 12 on ARM64
unbound -Voutput:
Version 1.20.1
Configure line: --with-libevent Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.0.13 30 Jan 2024 Linked modules: dns64 respip validator iterator
Additional information Add any other information that you may have gathered about the issue here.
It should not loop like that; I would like to know what unbound is looping over. The perf says this is entirely within libevent and some anonymous functions, I assume inlined in libevent.
Is it possible to get an ordinary stack trace, with like gstack <pid> , maybe several times to catch different parts of the loop? Likely the lruhash lookup and rbtree find results are from the other threads, perhaps that have ordinary cache responses and lookups. It would be nice to be able to reproduce the issue, but I have no clue what is the cause of it.
Hi,
after I had a very similar situation with Prosody (XMPP server) running on the very same server box using 100% CPU time after a while, printing "too many open files" into the error log, I increased the "nofiles" entry in /etc/security/limit.conf, and unbound as well as Prosody are running fine since (ulimit was showing 1024, increased it to 50.000).
Cannot tell whether this this is related, but there does seem to be a connection somehow.
Kind regards, mistersixt.
A similliar issue, with Flame Graph attached
Hi,
after I had a very similar situation with Prosody (XMPP server) running on the very same server box using 100% CPU time after a while, printing "too many open files" into the error log, I increased the "nofiles" entry in /etc/security/limit.conf, and unbound as well as Prosody are running fine since (ulimit was showing 1024, increased it to 50.000).
Cannot tell whether this this is related, but there does seem to be a connection somehow.
Kind regards, mistersixt.
Increasing the ulimit did not help by the way, I still see the looping unbound from time to time. Anything you need me to do while the process is looping?
Is this maybe related to the issue some people are facing?
https://nlnetlabs.nl/projects/unbound/security-advisories/
I don't think this is CVE-2024-8508 related. The looping would happen in Unbound's own name compression functions and not libevent.
We encountered this in our staging environment after an upgrade to 1.21.1. The stats come out of unbound-control and are transformed into prometheus metrics via unbound_exporter. During the 100% CPU utilization windows, the affected unbound instances were unable to service most requests from our application.
Related, we received three segfaults. The hosts had sufficient memory available and no OOMs were detected.
2024-10-02T09:12:51.158852+00:00 host1 6 kernel: [10678360.758037] unbound[2303964]: segfault at 0 ip 00007f5bc644c1fe sp 00007f5bc55046d0 error 4 in libc-2.31.so[7f5bc63d4000+178000]
2024-10-03T09:22:42.548707+00:00 host2 6 kernel: [10765210.843670] unbound[4111983]: segfault at 1b0 ip 00007f503330715b sp 00007ffcc09e95c0 error 4 in libevent-2.1.so.7.0.0[7f50332f5000+34000]
2024-10-04T11:42:07.683781+00:00 host3 6 kernel: [10859577.868995] unbound[2419270]: segfault at 1b0 ip 00007fbcce7e615b sp 00007ffcf0d76cc0 error 4 in libevent-2.1.so.7.0.0[7fbcce7d4000+34000]
Build information:
./configure --prefix="${UNBOUND_PKG_DIR}/artifacts" \
--enable-static-exe \
--disable-shared \
--with-libevent \
--with-libnghttp2 \
--enable-tfo-client \
--enable-tfo-server
$ dpkg -l | grep libevent
ii libevent-2.1-7:amd64 2.1.11-stable-1 amd64 Asynchronous event notification library
ii libevent-core-2.1-7:amd64 2.1.11-stable-1 amd64 Asynchronous event notification library (core)
ii libevent-dev 2.1.11-stable-1 amd64 Asynchronous event notification library (development files)
ii libevent-extra-2.1-7:amd64 2.1.11-stable-1 amd64 Asynchronous event notification library (extra)
ii libevent-openssl-2.1-7:amd64 2.1.11-stable-1 amd64 Asynchronous event notification library (openssl)
ii libevent-pthreads-2.1-7:amd64 2.1.11-stable-1 amd64 Asynchronous event notification library (pthreads)
$ dpkg -l | grep openssl
ii libevent-openssl-2.1-7:amd64 2.1.11-stable-1 amd64 Asynchronous event notification library (openssl)
ii openssl 1.1.1f-1ubuntu2.23 amd64 Secure Sockets Layer toolkit - cryptographic utility
I also encountered a similar problem. Unbound has a worker thread hang on the evmap_io_active_ method, and after use gdb to check, struct event becomes a ring.
@wcawijngaards @gthess
unbound version: 1.17.1
libevent version:
perf:
gdb:
@wcawijngaards
Hi!
It looks like we started experiencing this issue about a week ago. We're using Unbound as a recursive DNS server on each virtual machine, which routes queries to different upstream DNS servers depending on the zone. Unfortunately, we won't be able to provide a reliable reproduction case since we have a fleet of 1,111 identical VMs, and instrumenting each one with diagnostic tools would be challenging. However, we've temporarily removed load from one of the affected VMs and can leave it untouched for the next few days. Are there any specific data points or diagnostics we could collect from the Unbound process that's stuck in this state that might help with debugging?
It would be nice to somehow have a means to reproduce the issue. So some sort of information from it that allows the problem to happen again. Like, it if was based on the query name that is looked up at that time. Or some other aspect of it.
A stack trace could confirm where it is looping. If that is also evmap_io_active_. If that is the case, and there is no reproduction, what feature or perhaps libevent library version, that can be used. Possibly using libevent, a different version of it, or 'libev' a libevent compatiable library, or maybe mini-event (if that can handle the server load), that could see if the issue is actually linked to libevent, and perhaps part of how Unbound uses it in some way, or another issue in Unbound. Since that function is in a dependency library, it may be useful to try to figure out in what way the dependency library is involved.
Are there recent log events that seem to provide information, specifically for reproducing the issue. If it happens every time, debugging is much easier, than guess work. Is there perhaps some config setting used on the VMs that promotes the problem occurrence?