rtmfp-cpp icon indicating copy to clipboard operation
rtmfp-cpp copied to clipboard

REDIRECTOR CRASH AS CLIENTS CALLED CLOSE.

Open dangvohoanganhvn opened this issue 1 year ago • 12 comments

Hi Thornburgh, I've tested your redirector code in your example on CentOS7 and met a crash as many redirector's client disconnect (close) at the same time with the stack trace below. to specific, This is my test:

  • Run 30 relay connect to redirector and kill relay on CentOS7. rep rate : 7/10
  • This crash not happen in MacOS.

E20240911 14:01:58.894667 45021 ServerApp.cpp:248] *** SIGABRT (@0x3e90000afd5) received by PID 45013 (TID 0x7fe515295700) from PID 45013; stack trace: *** E20240911 14:01:58.894979 45021 ServerApp.cpp:248] @ 0x7fe518019630 (unknown) E20240911 14:01:58.895383 45021 ServerApp.cpp:248] @ 0x7fe516e30387 __GI_raise E20240911 14:01:58.895720 45021 ServerApp.cpp:248] @ 0x7fe516e31a78 __GI_abort E20240911 14:01:58.896059 45021 ServerApp.cpp:248] @ 0x7fe516e72ed7 __libc_message E20240911 14:01:58.896251 45021 ServerApp.cpp:248] @ 0x7fe516e7b299 _int_free E20240911 14:01:58.899547 45021 ServerApp.cpp:248] @ 0x6ee63a com::zenomt::List<>::~List() E20240911 14:01:58.902670 45021 ServerApp.cpp:248] @ 0x6eb4a2 com::zenomt::rtmfp::RecvFlow::~RecvFlow() E20240911 14:01:58.906158 45021 ServerApp.cpp:248] @ 0x6eb629 com::zenomt::rtmfp::RecvFlow::~RecvFlow() E20240911 14:01:58.908824 45021 ServerApp.cpp:248] @ 0x6ef4ba _ZNSt19_Sp_counted_deleterIPN3com6zenomt5rtmfp8RecvFlowEZNS1_9share_refIS3_EESt10shared_ptrIT_EPS7_bEUlPNS1_6ObjectEE_SaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv E20240911 14:01:58.911636 45021 ServerApp.cpp:248] @ 0x702fd7 com::zenomt::rtmfp::Session::abortFlowsAndTimers() E20240911 14:01:58.914172 45021 ServerApp.cpp:248] @ 0x7022b0 com::zenomt::rtmfp::Session::abort() E20240911 14:01:58.917240 45021 ServerApp.cpp:248] @ 0x70bbb3 com::zenomt::rtmfp::Session::onCloseAckChunk() E20240911 14:01:58.920346 45021 ServerApp.cpp:248] @ 0x70b155 com::zenomt::rtmfp::Session::onChunk() E20240911 14:01:58.923573 45021 ServerApp.cpp:248] @ 0x6fde94 com::zenomt::rtmfp::ISession::onReceivePacket() E20240911 14:01:58.926578 45021 ServerApp.cpp:248] @ 0x70a4d6 com::zenomt::rtmfp::Session::onReceivePacket() E20240911 14:01:58.929595 45021 ServerApp.cpp:248] @ 0x6dfbcd com::zenomt::rtmfp::RTMFP::onReceivePacket() E20240911 14:01:58.932370 45021 ServerApp.cpp:248] @ 0x6db518 com::zenomt::rtmfp::PosixPlatformAdapter::receiveOnePacket() E20240911 14:01:58.932555 45021 ServerApp.cpp:248] @ 0x6db5cc std::_Function_handler<>::_M_invoke() E20240911 14:01:58.935199 45021 ServerApp.cpp:248] @ 0x6c764d com::zenomt::EPollRunLoop::processActivatedItems() E20240911 14:01:58.938537 45021 ServerApp.cpp:248] @ 0x6c72f0 com::zenomt::EPollRunLoop::run() Do you have any insight for this crash ?

Thanks.

dangvohoanganhvn avatar Sep 11 '24 08:09 dangvohoanganhvn

thank you. i'll look into this.

zenomt avatar Sep 11 '24 14:09 zenomt

the closest i have to CentOS is a Fedora 40 VM. i'm not having any luck reproducing there or on Ubuntu. i'll try to find something with a code stare.

zenomt avatar Sep 11 '24 20:09 zenomt

if you can recompile with debug symbols and get a traceback with file+line numbers, that would also be helpful.

zenomt avatar Sep 11 '24 20:09 zenomt

Thanks, I'll try to reproduce it with your original code.

dangvohoanganhvn avatar Sep 12 '24 01:09 dangvohoanganhvn

the usual kinds of errors that result in a crash like that are either: ".cpp out of sync with .hpp" (if you're using the Makefile and you change a .hpp file, you should do a "make clean" and rebuild everything because header file dependencies are not tracked in the Makefiles); or a "use after free" programming error. i'm not seeing any obvious places in redirector.cpp where use-after-free would happen especially with a RecvFlow -- it looks like the shared_ptrs should be keeping everything around long enough.

Thanks, I'll try to reproduce it with your original code.

are you using redirector.cpp as-is, or have you modified it or incorporated it into your code?

EDIT: note that header files could've changed if you did a "git pull" but didn't then "make clean" before doing "make" again.

zenomt avatar Sep 12 '24 03:09 zenomt

I'm still using the full code of redirector and defined some USER_DATA in RedirectorClient.hpp like enum { USER_DATA_PUBLISH = 0x0d, USER_DATA_UNPUBLISH = 0x0a, USER_DATA_LIVESTREAM_URL = 0x0f, };

And just added a thread to handle another task not related to rtmfp. All functions and run loop platform have not changed, similar to your example. Thanks for your recommendation. I will re-check again

dangvohoanganhvn avatar Sep 12 '24 03:09 dangvohoanganhvn

to avoid conflicts with upstream (my code), i'd avoid making any changes to library files (headers or cpp). "user data" for the Redirector protocol is application-specific, and should probably go in your own application header files separate from RTMFP library files. see https://github.com/zenomt/rtmfp-cpp/blob/0fdd268430f9f453ebc64bb353c0af36a239cc8e/include/rtmfp/namespace.ttl#L138

zenomt avatar Sep 12 '24 03:09 zenomt

have you been able to reproduce with debug symbols? does it still reproduce after a clean and rebuild?

zenomt avatar Sep 13 '24 02:09 zenomt

Thanks for your suggestion. I found that the crash was related to call onDraining and close when client disconnetcs to redirector. If clients only call close without onDraining, everything will be fine. What does onDraining mean on your code?

dangvohoanganhvn avatar Sep 13 '24 02:09 dangvohoanganhvn

is the redirector crashing, or is the RedirectorClient (the "relay" i assume) crashing?

again, a backtrace with the debugging symbols (to get line numbers and no omitted steps from the optimizer) would be very very helpful.

zenomt avatar Sep 13 '24 03:09 zenomt

The redirector is crashing. For the backtrace, I'll get it for you as soon as possible..

dangvohoanganhvn avatar Sep 13 '24 03:09 dangvohoanganhvn

btw "onDraining" is called in the redirector when the client says it is "inactive" but still connected. being inactive removes you from consideration from being redirected to. it's called "draining" because the idea is you'd go inactive to not accept new connections, but you could continue running for a while so your current clients can "drain" naturally.

zenomt avatar Sep 13 '24 03:09 zenomt

The problem seems to be fixed as call m_ackFlows.clear() in Session::abortFlowsAndTimers(). m_ackFlows may be invalid after disconnect/draining...

dangvohoanganhvn avatar Sep 17 '24 02:09 dangvohoanganhvn

there's definitely a problem with not clearing m_ackFlows during abortFlowsAndTimers() since the RecvFlow and the Session have shared_ptr references to each other. it seems like if a RecvFlow was in its session's m_ackFlows when the session moved to closing/closed (because a message came in on a flow right before a Session Close Ack), the RecvFlow could get stuck there and neither would be freed. i don't see how that could lead to the backtrace that you posted at the top of this Issue though. i'll think through it some more in case i'm not seeing something.

zenomt avatar Sep 18 '24 05:09 zenomt

in case i wasn't clear: there's definitely a bug in Session::abortFlowsAndTimers() where there's a potential to leave circular references between m_ackFlows and the RecvFlows in it. thank you for finding that. i have a fix for that prepared (that calls m_ackFlows.clear() at the end and is more conservative with retaining the session), but i don't want to push it to GitHub until i've had more time to test it, and i won't be able to do that before the weekend.

i'm worried because i don't yet see how this bug could cause the backtrace you posted, though.

zenomt avatar Sep 18 '24 06:09 zenomt

For more details, I used the glog library to capture and print the crash backtrace, but I don't know why the backtrace looks like that... however as I try to call m_ackFlows.clear() in Session::abortFlowsAndTimers(), I can't reproduce the crash anymore

dangvohoanganhvn avatar Sep 18 '24 06:09 dangvohoanganhvn

commit https://github.com/zenomt/rtmfp-cpp/commit/2630f3d04b3a582252fe4213637fcc542e000bd8 breaks the reference loop in Session::abortFlowsAndTimers(), and for extra safety also keeps a reference to the Session during the entire packet processing in case the session is deleted and deallocated in the middle of processing a packet.

please let me know if this change resolves your crash (after reverting any local changes you've made).

zenomt avatar Sep 21 '24 19:09 zenomt

Thanks for your help, as I mentioned above ,the issue was solved as call m_ackFlows.clear().

dangvohoanganhvn avatar Sep 23 '24 09:09 dangvohoanganhvn

as I mentioned above ,the issue was solved as call m_ackFlows.clear().

you might've put m_ackFlows.clear() elsewhere in the Session::abortFlowsAndTimers() method. regardless, i'd prefer if you would confirm that this commit as-is resolves your crash, so i can close this Issue. thanks.

zenomt avatar Sep 23 '24 14:09 zenomt

I picked and tried your updated commit and it fixed the crash. Thanks

dangvohoanganhvn avatar Sep 23 '24 14:09 dangvohoanganhvn

thanks for confirming.

zenomt avatar Sep 23 '24 14:09 zenomt