unbound
unbound copied to clipboard
Unbound has a worker thread hang on the evmap_io_active_ method, and after use gdb to check, struct event becomes a ring.
Describe the bug Unbound has a worker thread hang on the evmap_io_active_ method, and after use gdb to check, struct event becomes a ring.
this event is a normal tcp query.
unbound version: 1.17.1
libevent version:
perf:
gdb:
To reproduce Steps to reproduce the behavior: 1. 2. 3.
Expected behavior A clear and concise description of what you expected to happen. CPU utilization 100%
System:
- Unbound version: 1.17.1
- OS:
unbound -Voutput:
Additional information Add any other information that you may have gathered about the issue here.
Since I cannot reproduce this, it comes down to figuring out how to reproduce it or attempting to get more information. The error is in the libevent library. Is it reproducible in some way, that I can do too? Perhaps it only happens for the issue reporter.
About, fairly randomly trying, to fix it. The unbound and libevent versions are a couple years older, and there are newer versions. I did not spot a change in Unbound on this topic recently, but there have been issues similar, and this was due to initialization before use, or reuse of event structures. On the libevent change log, there have been mentions of fixes for event_active, event_del for race conditions, and endless loop fix for evmap, but perhaps that is not precisely the bug that it shows in the debug output. It may be helpful though to try that, perhaps use the latest libevent, 2.1.12-stable (from 2020), or 2.2.1 alpha (from 2023). Also the latest Unbound versions could be used. The question is if the bug is in there too. This is somewhat likely, that the bug then persists, but good to rule out.
Then it would be down to figuring out where the problem comes from. The interaction in Unbound, if repeatable, could be useful to have. What sort of activity is happening that causes this. Logs could tell more.
It is possible to compile Unbound without libevent, --without-libevent, it then uses a builtin, select-based, event report back end. That would not loop inside libevent any more, and perhaps show the bug in Unbound, due to different code.
There is also libev, another event library that can be used, it is like libevent, but different code. It is perhaps precisely the alternative that is needed, in that it works like libevent, but would have different code inside. This may, if the bug is in Unbound, expose the bug in a different way. This would tell if the bug is in Unbound or the library, or specific to a library version.
In https://github.com/libevent/libevent/blob/master/event.c the event_active_nolock_ routine calls the callback from the user always, so it could then be calling the unbound callback routine a lot of times, which unbound callback routine is that? That would help pinpoint what events are involved. If the callback routine from unbound is not called, by event_callback_activate_nolock_ at 3053, the routine must return at 3016 with the ev->ev_flags & EVLIST_FINALIZING condition. That has a debug comment too. Of course the list should not be an endless loop. But this explains what it is doing in that list. Perhaps debug code for the condition in 3016 could be helpful, and the debug assertions from libevent. Then in https://github.com/libevent/libevent/blob/master/evmap.c there is the evmap_io_active_ routine. It loops over the list and calls the routine. The list must have been endless before the routine was started, I guess. That does not immediately reveal why the list was endless. And all of this could be due to Unbound calling the library or not re-initializing memory, possibly.
(maybe related to #1113)
Thanks for your suggestions. I've added some more GDB screenshots, but currently, I still can't reproduce the issue.
After searching the code for the event for comm_point_tcp_handle_callback, I have found that it has a boolean, event_added, and this tracks if the event has been added, with event_add, or not. This seems to be set when it is added, and unset when it is removed. It makes sure the event is not added twice, or deleted twice. So I believe that this event is not added twice or deleted twice. If that is not the source of the problem, eg. not a problem in event_add or event_del calls for the event, it would then have to be something else. The code performs a clean malloc, zeroed, before use of the struct event. A look in the code did not find the issue. For reproduction I would want to test the latest code, and then try to have debug. The tcp handler event has both a timeout as well as read and write events for it, where other event elements have only the fd or only timeouts, with UDP based communication, it also flips the timeouts and read and write flags for the TCP communication of DNS datagrams over TCP, and that is different from other event elements, but I see no obvious failure.