unbound
unbound copied to clipboard
Unbound 1.14.0 crashed by SIGSEGV in pending_udp_query
Unbound 1.14.0 crashes sometimes with backtrace like that
Program terminated with signal SIGSEGV, Segmentation fault.
#0 pending_cmp (key1=0x13f5af500, key2=0x0) at services/outside_network.c:103
103 services/outside_network.c: No such file or directory.
[Current thread is 1 (LWP 2416582)]
gdb-peda$ bt
#0 pending_cmp (key1=0x13f5af500, key2=0x0) at services/outside_network.c:103
#1 0x000000000281c5df in rbtree_insert (rbtree=0x118cdc9e0, data=data@entry=0x13f5af500) at util/rbtree.c:241
#2 0x00000000027df48f in select_id (outnet=0x101390c00, pend=0x13f5af500, packet=0x13455fd90) at services/outside_network.c:2134
#3 randomize_and_send_udp (pend=pend@entry=0x13f5af500, packet=packet@entry=0x13455fd90, timeout=timeout@entry=0xc8) at services/outside_network.c:2288
#4 0x00000000027df36a in pending_udp_query (sq=sq@entry=0x1874e2800, packet=0x0, packet@entry=0x13455fd90, timeout=0x5aac7001, cb=0x27e0ff0 <serviced_udp_callback>, cb_arg=cb_arg@entry=0x1874e2800)
at services/outside_network.c:2379
#5 0x00000000027e16bc in serviced_udp_send (sq=sq@entry=0x1874e2800, buff=buff@entry=0x13455fd90) at services/outside_network.c:2978
#6 0x00000000027e1b57 in outnet_serviced_query (outnet=<optimized out>, outnet@entry=0x101390c00, qinfo=<optimized out>, qinfo@entry=0x1c, flags=<optimized out>, flags@entry=0x0,
dnssec=<optimized out>, dnssec@entry=0x0, want_dnssec=<optimized out>, want_dnssec@entry=0x0, nocaps=<optimized out>, nocaps@entry=0x0, tcp_upstream=<optimized out>, ssl_upstream=<optimized out>,
tls_auth_name=<optimized out>, addr=<optimized out>, addrlen=<optimized out>, zone=<optimized out>, zonelen=<optimized out>, qstate=<optimized out>, callback=<optimized out>,
callback_arg=<optimized out>, buff=<optimized out>, env=<optimized out>) at services/outside_network.c:3501
#7 0x000000000279bec4 in worker_send_query (qinfo=<optimized out>, flags=<optimized out>, dnssec=<optimized out>, want_dnssec=0x281b400, nocaps=0x667c1100, addr=<optimized out>,
addrlen=<optimized out>, zone=<optimized out>, zonelen=<optimized out>, tcp_upstream=<optimized out>, ssl_upstream=<optimized out>, tls_auth_name=<optimized out>, q=<optimized out>)
at daemon/worker.c:2144
#8 0x00000000027b17d4 in processQueryTargets (qstate=<optimized out>, iq=0x14711c498, ie=<optimized out>, id=<optimized out>) at iterator/iterator.c:2690
#9 iter_handle (qstate=<optimized out>, iq=0x14711c498, ie=<optimized out>, id=<optimized out>) at iterator/iterator.c:3770
#10 0x00000000027d90dd in mesh_run (mesh=mesh@entry=0x1013a2e00, mstate=mstate@entry=0x14711c040, ev=ev@entry=module_event_new, e=e@entry=0x0) at services/mesh.c:1742
#11 0x00000000027d8ab8 in mesh_new_client (mesh=0x1013a2e00, qinfo=0x7f5b51ce7560, qinfo@entry=0x7f5b51ce7558, cinfo=<optimized out>, cinfo@entry=0x800000000000000, qflags=0x8b14, edns=0x7f5b51ce7590,
edns@entry=0x7f5b51ce7588, rep=rep@entry=0x7f5b51ce78e0, qid=0x8b14) at services/mesh.c:591
#12 0x0000000002799ba3 in worker_handle_request (c=<optimized out>, arg=<optimized out>, error=<optimized out>, repinfo=0x7f5b51ce78e0) at daemon/worker.c:1545
#13 0x00000000028176d7 in comm_point_udp_callback (fd=fd@entry=0xcd, event=<optimized out>, arg=<optimized out>, arg@entry=0x135ee8200) at util/netevent.c:784
#14 0x000000000288d43e in event_persist_closure (base=0x1254a3200, ev=0x1364c7180) at libs/libevent/event.c:1623
#15 event_process_active_single_queue (base=0x1254a3200, activeq=0x101955d50, max_to_process=max_to_process@entry=0x7fffffff, endtime=endtime@entry=0x0) at libs/libevent/event.c:1682
#16 0x000000000288a25c in event_process_active (base=0x1254a3200) at libs/libevent/event.c:1783
#17 event_base_loop (base=0x1254a3200, flags=<optimized out>, flags@entry=0x0) at libs/libevent/event.c:2006
#18 0x0000000002889c27 in event_base_dispatch (event_base=0x13f5af500) at libs/libevent/event.c:1817
#19 0x0000000002822aa5 in ub_event_base_dispatch (base=0x13f5af500) at util/ub_event.c:280
#20 0x0000000002816cfc in comm_base_dispatch (b=<optimized out>) at util/netevent.c:256
#21 0x000000000279bef9 in worker_work (worker=worker@entry=0x106f01800) at daemon/worker.c:2056
#22 0x000000000278cab1 in thread_start (arg=0x106f01800) at daemon/daemon.c:544
#23 0x00007f5b92edb6db in __gettimeofday@plt () from /lib/x86_64-linux-gnu/libpthread.so.0
#24 0x00007f5b51ce9700 in ?? ()
#25 0x00007f5b51ce9700 in ?? ()
#26 0x3880797b27297d58 in ?? ()
#27 0x00007f5b51ce7c00 in ?? ()
#28 0x0000000000000000 in ?? ()
Unfortunately I don't have an algorithm to reproduce this.
Are there any ideas what could go wrong? I can send additional info if needed. Thanks!
This is very likely a bug that is fixed in 1.15.0. There was a sequence of bug fixes, around fixing segfaults like this. I think an upgrade to a newer version of Unbound could likely remove the issue.
Hi @wcawijngaards Are you referring to this pr? https://github.com/NLnetLabs/unbound/pull/612
In what scenarios does this issue occur, and if I only set a worker thread, will this issue occur?
Yes that is a fix, the release had more bug fixes. This could be one of those issues caused by it, and looks similar. So I think an upgrade likely fixes it and is useful to avoid fixing the same thing again.
That issue was infrequent, and could also happen with one thread.
Hi @wcawijngaards I'm actually new to unbound, and I'm working on unbound code because we plan to add some features based on unbound in the future. I'd like to consult about this pr.(https://github.com/NLnetLabs/unbound/pull/612)
- Why do you put serviced_udp_send/serviced_tcp_send in serviced_timer_cb function?
- The sq->busy flag seems to fix race condition between randomize_and_send_udp/pending_tcp_query and serviced_delete. why is there a data race between the two functions?
Hi @JiangHeng12138,
Since I authored the PR I'll reply to your questions. Without getting into too much detail:
- So that network activity happens outside of the mesh state logic as a separate event, to avoid a race condition that may happen while waiting for network IO.
- Because serviced_delete (also called from outnet_serviced_query_stop) may delete a serviced_query while that query is waiting for network IO and should not be deleted.