unbound icon indicating copy to clipboard operation
unbound copied to clipboard

crash due to rbtree_remove of a node never inserted

Open glebius opened this issue 3 years ago • 8 comments

Unbound crashes due to NULL pointer dereference. This definitely is associated with TCP requests, however my reading of the code says the bug is in the event library and is not specific to TCP.

#3  0x000000080115f531 in comm_point_stop_listening (c=0x801dd0000)
    at unbound/util/netevent.c:4107
#4  0x00000008011458fa in tcp_req_info_setup_listen (req=0x801c55100, req@entry=0x1)
    at unbound/services/listen_dnsport.c:1941
#5  0x0000000801145aaf in tcp_req_info_handle_readdone (req=<optimized out>)
    at unbound/services/listen_dnsport.c:2079
#6  0x00000008011619ca in tcp_callback_reader (c=c@entry=0x801dd0000)
    at unbound/util/netevent.c:1168
#7  0x000000080115f8bc in comm_point_tcp_handle_read (fd=fd@entry=0xd, c=c@entry=0x801dd0000, short_ok=short_ok@entry=0x0)
    at unbound/util/netevent.c:1758
#8  0x000000080115efe1 in comm_point_tcp_handle_callback (fd=0xd, event=<optimized out>, arg=0x801dd0000)
    at unbound/util/netevent.c:2152
#9  0x000000080115348c in handle_select (base=0x801c70780, wait=<optimized out>)
    at unbound/util/mini_event.c:220
#10 minievent_base_dispatch (base=0x801c70780) at unbound/util/mini_event.c:242
#11 0x000000000104077a in ub_event_base_dispatch (base=0x801ca10c0)
    at unbound/util/ub_event.c:280
#12 0x000000080115ce21 in comm_base_dispatch (b=<optimized out>)
    at unbound/util/netevent.c:256
#13 0x000000000104574e in worker_work (worker=<optimized out>)
    at unbound/daemon/worker.c:1940
#14 0x0000000001035563 in daemon_fork (daemon=<optimized out>, daemon@entry=0x801c1a000)
    at unbound/daemon/daemon.c:701
#15 0x0000000001040d4f in run_daemon (cfgfile=0x7fffffffeef7 "/unbound.conf", cmdline_verbose=0x0, debug_mode=0x0, need_pidfile=0x1)
    at unbound/daemon/unbound.c:736
#16 main (argc=<optimized out>, argv=<optimized out>) at unbound/daemon/unbound.c:837

Looking into the node we see it was not removed before, since removed node would have pointers set to <rbtree_null_node>.

(gdb) frame 0
(gdb) p *to_delete
$16 = {
  parent = 0x0,
  left = 0x0,
  right = 0x0,
  key = 0x801ca10c0,
  color = 0x0
}

Thus, node was never inserted into the tree.

The node insertion predicate is timeout pointer passed and EV_TIMEOUT flag set in event_add() at mini_event.c:311:

        if(tv && (ev->ev_events&EV_TIMEOUT)) {
#ifndef S_SPLINT_S
                struct timeval *now = ev->ev_base->time_tv;
                ev->ev_timeout.tv_sec = tv->tv_sec + now->tv_sec;
                ev->ev_timeout.tv_usec = tv->tv_usec + now->tv_usec;
                while(ev->ev_timeout.tv_usec >= 1000000) {
                        ev->ev_timeout.tv_usec -= 1000000;
                        ev->ev_timeout.tv_sec++;
                }
#endif
                (void)rbtree_insert(ev->ev_base->times, &ev->node);
        }

The node removal predicate is just the flag, see event_del() at mini_event.c:332.

So potentially we would create offending event if event_add() is ever called with NULL tv. This is possible in netevent.c:4165.

To reproduce We don't have a reproduce case for this. Happens not so often at a large fleet of machines.

Expected behavior Not crash!

System: OS: FreeBSD 14 local-unbound -V Version 1.13.1

Configure line: --with-ssl=/usr --with-libexpat=/usr --disable-dnscrypt --disable-dnstap --enable-ecdsa --disable-event-api --enable-gost --with-libevent --disable-subnet --disable-tfo-client --disable-tfo-server --with-pthreads--prefix=/usr --localstatedir=/var/unbound --mandir=/usr/share/man --build=freebsd Linked libs: mini-event internal (it uses select), OpenSSL 1.1.1l-freebsd 24 Aug 2021 Linked modules: dns64 respip validator iterator

BSD licensed, see LICENSE in source package for details. Report bugs to [email protected] or https://github.com/NLnetLabs/unbound/issues

glebius avatar Oct 25 '21 21:10 glebius

Hi there,

Thanks for reporting this! If this is TCP related then 1.13.2 could help since it solved some cases with random crashes on the TCP reuse code.

You mention that this is maybe related to event handling but I don't have the time to look closer atm; will do later today or tomorrow. From a quick look I see that you compiled with libevent support but unbound couldn't find it and instead uses the builtin mini-event. Another option to test is to actually use libevent and see if you could hit the same bug.

If you can reproduce it in the meantime, that would be great!

gthess avatar Oct 26 '21 11:10 gthess

I meant embedded minievent actually. Please take a look at the pull request, I think I have nailed the problem.

Can you please point at 1.13.2 commit that fixed random crashes in TCP reuse code?

glebius avatar Oct 26 '21 14:10 glebius

I will. Could you reproduce (or just suffer) the problem in the meantime and verified that the change solves that?

Sure! There were a bunch of them (some also part of 1.13.1):

  • https://github.com/NLnetLabs/unbound/commit/a8485d58ca33b9841841e552dc98230e401b5588
  • https://github.com/NLnetLabs/unbound/commit/1bdae426eee393cf6916683f17ddd9d23a747457
  • https://github.com/NLnetLabs/unbound/commit/1bdae426eee393cf6916683f17ddd9d23a747457
  • https://github.com/NLnetLabs/unbound/commit/7396eff7af10eb85bee277ad06e2858cfa6f553e
  • https://github.com/NLnetLabs/unbound/commit/ff6b527184b33ffe1e2b643db8a32fae8061fc5a
  • https://github.com/NLnetLabs/unbound/commit/f9ad139c08000b9c7ae67a9592b5e9cb505f1fab
  • https://github.com/NLnetLabs/unbound/commit/224571256d6e826d3a3609e37e0bab4137288acd
  • https://github.com/NLnetLabs/unbound/commit/5b4555954a36c11ba7568c0618175203230c74a1
  • https://github.com/NLnetLabs/unbound/commit/0c74d6abbaf6d3aee0475bc311e2c27bb2a6a171

gthess avatar Oct 26 '21 14:10 gthess

These mostly work on the reuse rbtree, and we have crash in timeout rbtree. Some, e.g. 7396eff we already have in our tree. We use the unbound that is supplied with FreeBSD.

glebius avatar Oct 26 '21 15:10 glebius

I had a closer look at this and tried to reproduce it (not successfully). Indeed this is different than the issues fixed previously. This seems to be for in the out-of-order processing for clients, whereas the previous fixes were for the stream reuse for upstreams.

Were you able to stumble upon it again or reproduce it? Could you provide frames 0-2 in your initial post and any logging at the time of the crash? Also configuration may help if you are able to post it.

Calling ub_event_add() with NULL timeval is fine. In that case the event does not have the UB_EV_TIMEOUT bit set from the caller.

I wouldn't want to use the fix in the PR since it changes the event bits in the add routine which is something unexpected to calling code. But, did it solve the issue for you?

gthess avatar Nov 01 '21 09:11 gthess

Frames 0-2. Not much info here...

(gdb) frame 0
#0  rbtree_delete (rbtree=0x801dc5880, key=key@entry=0x801ca10c0) at FreeBSD/contrib/unbound/util/rbtree.c:341
(gdb) info local
smright = 0x0
to_delete = 0x801ca10c0
child = <optimized out>
(gdb) p *to_delete
$1 = {
  parent = 0x0,
  left = 0x0,
  right = 0x0,
  key = 0x801ca10c0,
  color = 0x0
}
(gdb) frame 1
#1  0x0000000801153856 in minievent_del (ev=0x801ca10c0)at FreeBSD/contrib/unbound/util/mini_event.c:334
(gdb) info locals
No locals.
(gdb) p *ev
$3 = {
  node = {
    parent = 0x0,
    left = 0x0,
    right = 0x0,
    key = 0x801ca10c0,
    color = 0x0
  },
  added = 0x1,
  ev_base = 0x801c70780,
  ev_fd = 0xd,
  ev_events = 0x13,
  ev_timeout = {
    tv_sec = 0x61742a49,
    tv_usec = 0xeabf4
  },
  ev_callback = 0x80115eec0 <comm_point_tcp_handle_callback>,
  ev_arg = 0x801dd0000
}
(gdb) frame 2
#2  0x000000000104090a in ub_event_del (ev=0x801ca10c0) at FreeBSD/contrib/unbound/util/ub_event.c:395
(gdb) info locals
No locals.

glebius avatar Nov 01 '21 15:11 glebius

I now have only single core and don't have logs. But I've been told that within fleet several machines had unbound crashed. I yet can't confirm whether my patch helps or not. We have rather long release cycle, and patched version isn't yet widely deployed.

glebius avatar Nov 01 '21 15:11 glebius

I believe I solved the issue with the above commits. For me the problem was really solved when I fixed the potential loop in the comm_point->tcp_free list. I also included changes to clear the UB_EV_TIMEOUT bit on event creation because it is a nice to have fix.

I am closing the related PR but I am leaving this issue open; it would be good if we get feedback that these changes solve the issue for you as well.

gthess avatar Nov 05 '21 10:11 gthess

Closing this as inactive; the observed issue is already resolved.

gthess avatar Mar 17 '23 13:03 gthess