pdns icon indicating copy to clipboard operation
pdns copied to clipboard

pdns recursor crash on startup,The probability of crash is relatively high

Open zjs604381586 opened this issue 1 year ago • 4 comments

  • system version: Debian 9

  • kernal version: Linux 4.19.117.bsk.10-amd64

  • pdns recursor version: 4.7.1

  • gcc/g++ version: 6.3.0

  • crash infos: Thread 8 "pdns-r/distr" received signal SIGSEGV, Segmentation fault.

  • crash code postion:

    • FileName: syncrec.cc
    • in class: nsspeeds_t
    • line code: 227 image
  • problem causes: The now variable released on the stack is sometimes used in the lambda function, because the SyncRes sr object may have been released, and the member d_now will also be released, so the now variable in the lambda function will be illegal and the program will crash

  • correct code: Modify line 227 as follows:lambda function uses value copy

    ind.modify(it, [now](DecayingEwmaCollection& d) { d.d_lastget = now; });
    

zjs604381586 avatar Sep 16 '22 15:09 zjs604381586

Thanks for the report. It's not clear to me yet what sequence of events could cause the scenario you describe. I'm also wondering why you are seeing this and others do not. Do you have backtrace perhaps? The configuration file would also be nice. Before fixing I would like to fully understand this and perhaps write a regression test for it.

omoerbeek avatar Sep 17 '22 14:09 omoerbeek

bt: #8 0x00005555559419ef in SyncRes::shuffleInSpeedOrder (this=this@entry=0x7fffd405cec0, tnameservers=std::unordered_map with 13 elements = {...}, prefix="", auth=...) at syncres.cc:1808 #9 0x000055555591b62b in SyncRes::doResolveAt (this=this@entry=0x7fffd405cec0, nameservers=..., auth=..., flawedNSSet=, flawedNSSet@entry=false, qname=..., qtype=..., ret=..., depth=, beenthere=..., state=, stopAtDelegation=) at syncres.cc:3749 #10 0x000055555591dd76 in SyncRes::doResolveNoQNameMinimization (this=this@entry=0x7fffd405cec0, qname=..., qtype=..., ret=..., depth=0, beenthere=..., state=, fromCache=, stopAtDelegation=) at syncres.cc:934 #11 0x000055555591fce4 in SyncRes::doResolve (this=this@entry=0x7fffd405cec0, qname=..., qtype=..., ret=std::vector of length 0, capacity 0, depth=depth@entry=0, beenthere=std::set with 1 elements = {...}, state=) at syncres.cc:767 #12 0x0000555555920e74 in SyncRes::beginResolve (this=this@entry=0x7fffd405cec0, qname=..., qtype=..., qclass=qclass@entry=1, ret=std::vector of length 0, capacity 0) at syncres.cc:164 #13 0x0000555555921568 in SyncRes::getRootNS(timeval, std::function<int (ComboAddress const&, DNSName const&, int, bool, bool, int, timeval*, boost::optional<Netmask>&, boost::optional<ResolveContext const&>, LWResult*, bool*)>) (now=..., asyncCallback=...) at syncres.cc:4042 #14 0x0000555555842df4 in houseKeeping () at pdns_recursor.cc:3005 #15 0x000055555586756b in MTasker<PacketID, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >::makeThread(void ()(void), void*)::{lambda()#1}::operator()() con---Type to continue, or q to quit--- st (__closure=0x7fffd405d5a8) at mtasker.cc:284 #16 boost::detail::function::void_function_obj_invoker0<MTasker<PacketID, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >::makeThread(void ()(void), void*)::{lambda()#1}, void>::invoke(boost::detail::function::function_buffer&) (function_obj_ptr=...) at /usr/include/boost/function/function_template.hpp:159 #17 0x000055555581b479 in boost::function0::operator() (this=0x7fffd405d5a0) at /usr/include/boost/function/function_template.hpp:771 #18 threadWrapper (t=...) at mtasker_fcontext.cc:144 #19 0x00007ffff773ae6b in make_fcontext () from /usr/lib/x86_64-linux-gnu/libboost_context.so.1.62.0 #20 0x0000000000000000 in ?? ()

code: image

describe: You don't need to look at the stack, there will be problems in the analysis of the code logic level。If the sr.beginResolve function call ends, the getRootNS function will also return. At this time, the sr object will be destructed. If the lambda function of the fastest function in nsspeeds_t has not been executed, illegal data access will occur, causing the process to crash.

zjs604381586 avatar Sep 17 '22 15:09 zjs604381586

Is this a backtrace of the crash you are referring to ? I see shuffleInspeedOrder bering executed, but getRootNS and beginResolve are on the stack. So the SyncRes object is still alive.

At this moment I still have trouble seeing how fastest and the lambda could be executed while the corresponding SyncRes has gone out of scope. beginResolve is a synchronous function, it returns only after work done (even though the name would suggest async execution).

omoerbeek avatar Sep 17 '22 15:09 omoerbeek

I have thought a bit about this a bit more but still have trouble seeing the circumstances you describe could happen: SyncRes being out of scope while fastest is being executed.

I really would appreciate both a config file and a full backtrace (not leaving out the topmost frames) of an actual crash you observed.

omoerbeek avatar Sep 19 '22 06:09 omoerbeek

Hello @zjs604381586 , it has been a week since my questions. Do you have answers?

omoerbeek avatar Sep 26 '22 06:09 omoerbeek

I also looked at the logic, maybe my analysis is wrong, sorry

zjs604381586 avatar Sep 26 '22 07:09 zjs604381586