unbound icon indicating copy to clipboard operation
unbound copied to clipboard

unbound kind of "loops" (100% CPU time) and is no longer reponsive

Open mistersixt opened this issue 1 year ago • 8 comments

perf-top-output top-with-threads

Describe the bug Since version 1.20.0 unbound starts using 100% CPU time after a few hours, sometimes even 200% and more, and the DNS answers take a looong time. After some further time DNS requests don't get answered at all any more. A regular "kill" does not stop the process, it needs to be "kill -9". Attached you can see the output of "top" and also from "perf top -p <-pid-of-unbound>".

The first "unbound" entries in "perf top" show "lruhash_lookup" and "rbtree_find_less_equal".

The amount of DNS requests is limited using iptables and hashlimit (150 per/minute with a burst of 45).

This behaviour is also with the current master source, 1.20.1 right now.

To reproduce Steps to reproduce the behavior:

  1. see above.

Expected behavior Unbound should be responsive all the time, and not looping after a few hours (like every 6 to 12 hours).

System:

  • Unbound version: 1.20.0 and 1.20.1 (current master)
  • OS: Debian 12 on ARM64
  • unbound -V output:

Version 1.20.1

Configure line: --with-libevent Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.0.13 30 Jan 2024 Linked modules: dns64 respip validator iterator

Additional information Add any other information that you may have gathered about the issue here.

mistersixt avatar Jul 26 '24 09:07 mistersixt

It should not loop like that; I would like to know what unbound is looping over. The perf says this is entirely within libevent and some anonymous functions, I assume inlined in libevent.

Is it possible to get an ordinary stack trace, with like gstack <pid> , maybe several times to catch different parts of the loop? Likely the lruhash lookup and rbtree find results are from the other threads, perhaps that have ordinary cache responses and lookups. It would be nice to be able to reproduce the issue, but I have no clue what is the cause of it.

wcawijngaards avatar Jul 26 '24 09:07 wcawijngaards

Hi,

after I had a very similar situation with Prosody (XMPP server) running on the very same server box using 100% CPU time after a while, printing "too many open files" into the error log, I increased the "nofiles" entry in /etc/security/limit.conf, and unbound as well as Prosody are running fine since (ulimit was showing 1024, increased it to 50.000).

Cannot tell whether this this is related, but there does seem to be a connection somehow.

Kind regards, mistersixt.

mistersixt avatar Aug 12 '24 13:08 mistersixt

A similliar issue, with Flame Graph attached unbound

wfyanmnm avatar Sep 12 '24 07:09 wfyanmnm

Hi,

after I had a very similar situation with Prosody (XMPP server) running on the very same server box using 100% CPU time after a while, printing "too many open files" into the error log, I increased the "nofiles" entry in /etc/security/limit.conf, and unbound as well as Prosody are running fine since (ulimit was showing 1024, increased it to 50.000).

Cannot tell whether this this is related, but there does seem to be a connection somehow.

Kind regards, mistersixt.

Increasing the ulimit did not help by the way, I still see the looping unbound from time to time. Anything you need me to do while the process is looping?

mistersixt avatar Sep 18 '24 06:09 mistersixt

Is this maybe related to the issue some people are facing?

https://nlnetlabs.nl/projects/unbound/security-advisories/

CVE-2024-8508

mistersixt avatar Oct 04 '24 05:10 mistersixt

I don't think this is CVE-2024-8508 related. The looping would happen in Unbound's own name compression functions and not libevent.

gthess avatar Oct 04 '24 08:10 gthess

We encountered this in our staging environment after an upgrade to 1.21.1. The stats come out of unbound-control and are transformed into prometheus metrics via unbound_exporter. During the 100% CPU utilization windows, the affected unbound instances were unable to service most requests from our application. Screenshot_2024-10-07_10-38-09 Screenshot_2024-10-07_10-38-23

Related, we received three segfaults. The hosts had sufficient memory available and no OOMs were detected.

2024-10-02T09:12:51.158852+00:00 host1 6 kernel: [10678360.758037] unbound[2303964]: segfault at 0 ip 00007f5bc644c1fe sp 00007f5bc55046d0 error 4 in libc-2.31.so[7f5bc63d4000+178000]
2024-10-03T09:22:42.548707+00:00 host2 6 kernel: [10765210.843670] unbound[4111983]: segfault at 1b0 ip 00007f503330715b sp 00007ffcc09e95c0 error 4 in libevent-2.1.so.7.0.0[7f50332f5000+34000]
2024-10-04T11:42:07.683781+00:00 host3 6 kernel: [10859577.868995] unbound[2419270]: segfault at 1b0 ip 00007fbcce7e615b sp 00007ffcf0d76cc0 error 4 in libevent-2.1.so.7.0.0[7fbcce7d4000+34000]

Build information:

./configure --prefix="${UNBOUND_PKG_DIR}/artifacts" \
  --enable-static-exe \
  --disable-shared \
  --with-libevent \
  --with-libnghttp2 \
  --enable-tfo-client \
  --enable-tfo-server


$ dpkg -l | grep libevent
ii  libevent-2.1-7:amd64                  2.1.11-stable-1                   amd64        Asynchronous event notification library
ii  libevent-core-2.1-7:amd64             2.1.11-stable-1                   amd64        Asynchronous event notification library (core)
ii  libevent-dev                          2.1.11-stable-1                   amd64        Asynchronous event notification library (development files)
ii  libevent-extra-2.1-7:amd64            2.1.11-stable-1                   amd64        Asynchronous event notification library (extra)
ii  libevent-openssl-2.1-7:amd64          2.1.11-stable-1                   amd64        Asynchronous event notification library (openssl)
ii  libevent-pthreads-2.1-7:amd64         2.1.11-stable-1                   amd64        Asynchronous event notification library (pthreads)

$ dpkg -l | grep openssl
ii  libevent-openssl-2.1-7:amd64          2.1.11-stable-1                   amd64        Asynchronous event notification library (openssl)
ii  openssl                               1.1.1f-1ubuntu2.23                amd64        Secure Sockets Layer toolkit - cryptographic utility

pgporada avatar Oct 07 '24 17:10 pgporada

I also encountered a similar problem. Unbound has a worker thread hang on the evmap_io_active_ method, and after use gdb to check, struct event becomes a ring.

@wcawijngaards @gthess

unbound version: 1.17.1 libevent version: Image

perf: Image

gdb: Image

wenxuan70 avatar May 27 '25 13:05 wenxuan70

@wcawijngaards

Hi!

It looks like we started experiencing this issue about a week ago. We're using Unbound as a recursive DNS server on each virtual machine, which routes queries to different upstream DNS servers depending on the zone. Unfortunately, we won't be able to provide a reliable reproduction case since we have a fleet of 1,111 identical VMs, and instrumenting each one with diagnostic tools would be challenging. However, we've temporarily removed load from one of the affected VMs and can leave it untouched for the next few days. Are there any specific data points or diagnostics we could collect from the Unbound process that's stuck in this state that might help with debugging?

kemko avatar Aug 03 '25 11:08 kemko

It would be nice to somehow have a means to reproduce the issue. So some sort of information from it that allows the problem to happen again. Like, it if was based on the query name that is looked up at that time. Or some other aspect of it.

A stack trace could confirm where it is looping. If that is also evmap_io_active_. If that is the case, and there is no reproduction, what feature or perhaps libevent library version, that can be used. Possibly using libevent, a different version of it, or 'libev' a libevent compatiable library, or maybe mini-event (if that can handle the server load), that could see if the issue is actually linked to libevent, and perhaps part of how Unbound uses it in some way, or another issue in Unbound. Since that function is in a dependency library, it may be useful to try to figure out in what way the dependency library is involved.

Are there recent log events that seem to provide information, specifically for reproducing the issue. If it happens every time, debugging is much easier, than guess work. Is there perhaps some config setting used on the VMs that promotes the problem occurrence?

wcawijngaards avatar Aug 04 '25 08:08 wcawijngaards