drogon WebSocket Crash on Remote Disconnect

Describe the bug I'm using the WebSocket interface and hit a use case where we are sending data repeatedly and during this process the remote side this ultimately results in a crash inside Trantor.

To Reproduce sending data repeatedly and during this process the remote side this ultimately results in a crash inside Trantor.

Expected behavior I'd expect that a detected connection problem during a send would result in a disconnect and call our callback.

I dug into the code a bit and see a problem with the Trantor TcpConnection code. The code is not doing anything when an EPIPE or ECONNRESET occurs. There is no indication passed to the caller that there was a connection error, thus we continue to send resulting in the crash. Wouldn't it be more appropriate to always treat these errors as a connection problem and always call connectDestroyed();? I've tested this and our connection disconnect callback is called. I have only tested this one error (line 956) so it may not be the best solution. Any thoughts on this problem?

Screenshots Trace

...
20210920 16:32:29.668717 UTC 9712 TRACE [sendWsData] send 111 bytes - WebSocketConnectionImpl.cc:70
20210920 16:32:29.668989 UTC 9712 TRACE [sendWsData] send 111 bytes - WebSocketConnectionImpl.cc:70
20210920 16:32:29.669017 UTC 9712 DEBUG [sendInLoop] EPIPE or ECONNRESET, erron=104 - TcpConnectionImpl.cc:956
20210920 16:32:29.669263 UTC 9712 TRACE [sendWsData] send 111 bytes - WebSocketConnectionImpl.cc:70

Crash:

libtrantor.so.1!trantor::TcpConnectionImpl::sendInLoop(trantor::TcpConnectionImpl * const this, const void * buffer, size_t length) (drogon/trantor/trantor/net/inner/TcpConnectionImpl.cc:956)
libtrantor.so.1!trantor::TcpConnectionImpl::send(trantor::TcpConnectionImpl * const this, std::__cxx11::string && msg) (drogon/trantor/trantor/net/inner/TcpConnectionImpl.cc:1180)
libdrogon.so.1!drogon::WebSocketConnectionImpl::sendWsData(drogon::WebSocketConnectionImpl * const this, const char * msg, uint64_t len, unsigned char opcode) (drogon/lib/src/WebSocketConnectionImpl.cc:156)
libdrogon.so.1!drogon::WebSocketConnectionImpl::send(drogon::WebSocketConnectionImpl * const this, const char * msg, uint64_t len, const drogon::WebSocketMessageType type) (drogon/lib/src/WebSocketConnectionImpl.cc:63)
libdrogon.so.1!drogon::WebSocketConnectionImpl::send(drogon::WebSocketConnectionImpl * const this, const std::__cxx11::string & msg, const drogon::WebSocketMessageType type) (drogon/lib/src/WebSocketConnectionImpl.cc:161)
...

Desktop (please complete the following information):

OS: Ubuntu 18.04 ( should not matter)

Sep 20 '21 17:09 tysenmoore-xse

@tysenmoore-xse thanks for your feed back. Which version of drogon did you used? would u please make a simple demo to reproduce this issue?

Sep 21 '21 01:09 an-tao

I am at Aug 15 commit--I have the stop() change. I will try to craft up a simpler app.

Sep 21 '21 12:09 tysenmoore-xse

I spent the day trying write a simplified application to reproduce this crash but cannot seem to get the timing right. The basic steps for the problem is as follows:

The Application (or APP) is setup as a HttpController
The Tester Application (or TESTER) starts a web socket server
The TESTER makes an HTTP request to the APP and indicates what URL to use for a web socket connection
The APP gets the HTTP request and uses drogon::WebSocketClient::newWebSocketClient() to create the web socket connection to the server.
The TESTER gets the response and sends an HTTP request to the APP to start sending data also indicating the number of messages to send
The APP gets the HTTP request and starts sending the number of messages to the TESTER using:


int count = (*jReq)["params"]["count"].asInt(); // WHERE `count` is 12
while ( count > 0 )
{
    drogon::app().getLoop()->queueInLoop([&, count]() {
        if ( wsConn )
        {
            if ( !wsConn->disconnected() )
            {
                //LOG_DEBUG << "send["<< count << "]...";
                wsConn->send(TEST_DATA);
            }
            else
            {
                LOG_DEBUG << "disconnected...";
                wsClient->stop();
            }
        }
    });
    count--;
}

The TESTER is setup to close the web socket server after receiving the third message
The APP gets an ECONNRESET prior to completing all 12 message send operations. The next message sent after the ECONNRESET results in the APP crashing with the stack trace I sent earlier. The reason for the crash is the code does not like trying to send to a closed socket. The TCP connection does see the ECONNRESET but does not propagate this error to any other code.

I can reproduce the 100% of the time with my large application code and a couple tests written to run in NodeJS. I tried to reproduce the test and application in a more compact form but it seems to send all messages from the APP, then the TESTER gets the notification that the close had occurred. I am not sure why my smaller APP sends the data any differently but there is a timing problem between the two.

When I add connectDestroyed() to the ECONNRESET handling in sendInLoop() the upper layers (including my registered close callback) are called. Also, the state is changed to disconnected and any following send operations are properly discarded without a crash. I don't know the code well enough to know if this is the best solution so I defer to others. I do see other ECONNRESET errors that are ignored so there may be other use cases that suffer the same problem. It does seem odd that the TCP handling sees the error and does not make it know to others.

I have attached the example code in its current state for reference, but realize the timing is not correct so the error is not reproduced. Any help or suggestions would be useful.

To run the test,

build then run the websocket_server
In the tester_app run npm install once
To run the test, npm run test

websocket_server_and_tester.tar.gz

Sep 21 '21 20:09 tysenmoore-xse

I can't reproduce the issue in my MacOS book, what's the environment of yours? 截屏2021-09-22 下午10 03 34

Sep 22 '21 14:09 an-tao

I checked your code, the wsClient and wsConn objects can be written and read in multiple threads at the same time (if your IO threads are multiple), this is not safe.

Sep 22 '21 14:09 an-tao

I appreciate you looking into the code and offering suggestions. The code I presented does not (unfortunately) reproduce the problem. Our normal config file sets threads_num to one. I tried to set the thread_num to 0,1,2 and none of these reproduced the problem either.

Let me better describe our larger application in which we are seeing problems. We have HW that is running Drogon for command and control of our HW/SW. Our current development environment is Ubuntu 18.04 while the FW will run Linux 5.x. Unlike most users of Drogon we have to serialize the requests otherwise we could be configuring the HW/SW inconsistently/etc. All our controller APIs call a single method to queue up the request for processing. When a request gets processed we call drogon::app().getLoop()->queueInLoop() to execute the handler which will send a response to the request. So unlike most users we really would like a single thread to simplify our design.

Based on your last response about the threading we changed our use of getLoop to getIOLoop(0) since we specify theards_num to one. After running our tests the problem has gone away. I'd really like to understand the Drogon architecture a little more. I looked over some of the documentation and only see light mentions of thread usage in the documentation. So some of my questions are:

There seems to be an app thread and an IO thread (or many depending on threads_num). I saw mention that IO threads are for "events/actions". Can you explain why mixing the thread usage may be a problem (locks, etc)?
We were doing all the locking/etc to process the request later and execute it on the app thread but this was allowing an ECONNRESET to occur--no other problems; whereas responding on the original request thread does not suffer this problem. Until we moved to web socket support we had no problems with this design (right or wrong). Can you explain what Drogon is doing that might cause this?
What is the use of the app thread since all IO must be done on IO threads? I see it says, "timers and other tasks", what "other tasks"?
What prevents the ECONNRESET during a send operation from happening? You have code to check for it but it is only logged and not passed to other layers of the SW. Why? It seems like even though the problem is now gone for our use case that this is still a possible issue, no?

Any information you can provide would be very useful in filling some gaps in my understanding of Drogon.

Thanks so much for the time and support.

Sep 22 '21 18:09 tysenmoore-xse

Turns out I am now seeing crashes in other places of the code. changed the simple app I sent you so drogon::app().getLoop() to drogon::app().getIOLoop(0). However, when I run the tests (npm run test) several times in a row even the simple app I wrote crashes.

20210923 13:00:03.402606 UTC 28929 TRACE [createNonblockingSocketOrDie] sock=9 - Socket.h:46
20210923 13:00:03.402652 UTC 28929 TRACE [connect] connecting - Connector.cc:67
20210923 13:00:03.402664 UTC 28929 TRACE [connecting] connecting:9 - Connector.cc:125
20210923 13:00:03.402750 UTC 28929 TRACE [~Socket] Socket deconstructed:15 - Socket.cc:241
20210923 13:00:03.402821 UTC 28929 TRACE [~TcpClient] TcpClient::~TcpClient[httpClient] - connector  - TcpClient.cc:79
20210923 13:00:03.402889 UTC 28929 TRACE [handleError] SO_ERROR = 111 Connection refused - Connector.cc:217
20210923 13:00:03.402928 UTC 28929 TRACE [operator()] error connecting to server - WebSocketClientImpl.cc:94

Call stack: (of the app thread)

[Unknown/Just-In-Time compiled code] (Unknown Source:0)
libdrogon.so.1!std::function<void (drogon::ReqResult, std::shared_ptr<drogon::HttpResponse> const&, std::shared_ptr<drogon::WebSocketClient> const&)>::operator()(drogon::ReqResult, std::shared_ptr<drogon::HttpResponse> const&, std::shared_ptr<drogon::WebSocketClient> const&) const(const std::function<void(drogon::ReqResult, const std::shared_ptr<drogon::HttpResponse>&, const std::shared_ptr<drogon::WebSocketClient>&)> * const this,  __args#0,  __args#1,  __args#2) (/usr/include/c++/7/bits/std_function.h:706)
libdrogon.so.1!drogon::WebSocketClientImpl::<lambda()>::operator()(void) const(const drogon::WebSocketClientImpl::<lambda()> * const __closure) (drogon/lib/src/WebSocketClientImpl.cc:95)
libdrogon.so.1!std::_Function_handler<void(), drogon::WebSocketClientImpl::createTcpClient()::<lambda()> >::_M_invoke(const std::_Any_data &)(const std::_Any_data & __functor) (/usr/include/c++/7/bits/std_function.h:316)
libdrogon.so.1!std::function<void ()>::operator()() const(const std::function<void()> * const this) (/usr/include/c++/7/bits/std_function.h:706)
libtrantor.so.1!trantor::TcpClient::<lambda()>::operator()(void) const(const trantor::TcpClient::<lambda()> * const __closure) (drogon/trantor/trantor/net/TcpClient.cc:71)
libtrantor.so.1!std::_Function_handler<void(), trantor::TcpClient::TcpClient(trantor::EventLoop*, const trantor::InetAddress&, const string&)::<lambda()> >::_M_invoke(const std::_Any_data &)(const std::_Any_data & __functor) (/usr/include/c++/7/bits/std_function.h:316)
libdrogon.so.1!std::function<void ()>::operator()() const(const std::function<void()> * const this) (/usr/include/c++/7/bits/std_function.h:706)
libtrantor.so.1!trantor::Connector::handleError(trantor::Connector * const this) (drogon/trantor/trantor/net/inner/Connector.cc:232)
libtrantor.so.1!std::__invoke_impl<void, void (trantor::Connector::*&)(), std::shared_ptr<trantor::Connector>&>(void (trantor::Connector::*&)(trantor::Connector * const) __f, std::shared_ptr<trantor::Connector> & __t) (/usr/include/c++/7/bits/invoke.h:73)
libtrantor.so.1!std::__invoke<void (trantor::Connector::*&)(), std::shared_ptr<trantor::Connector>&>(void (trantor::Connector::*&)(trantor::Connector * const) __fn) (/usr/include/c++/7/bits/invoke.h:95)
libtrantor.so.1!std::_Bind<void (trantor::Connector::*(std::shared_ptr<trantor::Connector>))()>::__call<void, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>)(std::_Bind<void (trantor::Connector::*(std::shared_ptr<trantor::Connector>))()> * const this, std::tuple<> && __args) (/usr/include/c++/7/functional:467)
libtrantor.so.1!std::_Bind<void (trantor::Connector::*(std::shared_ptr<trantor::Connector>))()>::operator()<, void>()(std::_Bind<void (trantor::Connector::*(std::shared_ptr<trantor::Connector>))()> * const this) (/usr/include/c++/7/functional:551)
libtrantor.so.1!std::_Function_handler<void (), std::_Bind<void (trantor::Connector::*(std::shared_ptr<trantor::Connector>))()> >::_M_invoke(std::_Any_data const&)(const std::_Any_data & __functor) (/usr/include/c++/7/bits/std_function.h:316)
libdrogon.so.1!std::function<void ()>::operator()() const(const std::function<void()> * const this) (/usr/include/c++/7/bits/std_function.h:706)
libtrantor.so.1!trantor::Channel::handleEventSafely(trantor::Channel * const this) (drogon/trantor/trantor/net/Channel.cc:80)
libtrantor.so.1!trantor::Channel::handleEvent(trantor::Channel * const this) (drogon/trantor/trantor/net/Channel.cc:66)
libtrantor.so.1!trantor::EventLoop::loop(trantor::EventLoop * const this) (drogon/trantor/trantor/net/EventLoop.cc:190)
libdrogon.so.1!drogon::HttpAppFrameworkImpl::run(drogon::HttpAppFrameworkImpl * const this) (drogon/lib/src/HttpAppFrameworkImpl.cc:598)
main() (drogon/examples/websocket_server/WebSocketServer.cc:135)

I see this a little different on my larger app. I am still working to track this down further but I see:

Call Stack: (app thread)

libtrantor.so.1!trantor::Channel::remove(trantor::Channel * const this) (/home/tmoore/work/devel/a2b/core-service/repos/drogon/trantor/trantor/net/Channel.cc:43)
libtrantor.so.1!trantor::TcpConnectionImpl::connectDestroyed(trantor::TcpConnectionImpl * const this) (/home/tmoore/work/devel/a2b/core-service/repos/drogon/trantor/trantor/net/inner/TcpConnectionImpl.cc:905)
libtrantor.so.1!std::__invoke_impl<void, void (trantor::TcpConnectionImpl::*&)(), std::shared_ptr<trantor::TcpConnectionImpl>&>(void (trantor::TcpConnectionImpl::*&)(trantor::TcpConnectionImpl * const) __f, std::shared_ptr<trantor::TcpConnectionImpl> & __t) (/usr/include/c++/7/bits/invoke.h:73)
libtrantor.so.1!std::__invoke<void (trantor::TcpConnectionImpl::*&)(), std::shared_ptr<trantor::TcpConnectionImpl>&>(void (trantor::TcpConnectionImpl::*&)(trantor::TcpConnectionImpl * const) __fn) (/usr/include/c++/7/bits/invoke.h:95)
libtrantor.so.1!std::_Bind<void (trantor::TcpConnectionImpl::*(std::shared_ptr<trantor::TcpConnectionImpl>))()>::__call<void, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>)(std::_Bind<void (trantor::TcpConnectionImpl::*(std::shared_ptr<trantor::TcpConnectionImpl>))()> * const this, std::tuple<> && __args) (/usr/include/c++/7/functional:467)
libtrantor.so.1!std::_Bind<void (trantor::TcpConnectionImpl::*(std::shared_ptr<trantor::TcpConnectionImpl>))()>::operator()<, void>()(std::_Bind<void (trantor::TcpConnectionImpl::*(std::shared_ptr<trantor::TcpConnectionImpl>))()> * const this) (/usr/include/c++/7/functional:551)
libtrantor.so.1!std::_Function_handler<void (), std::_Bind<void (trantor::TcpConnectionImpl::*(std::shared_ptr<trantor::TcpConnectionImpl>))()> >::_M_invoke(std::_Any_data const&)(const std::_Any_data & __functor) (/usr/include/c++/7/bits/std_function.h:316)
std::function<void ()>::operator()() const(const std::function<void()> * const this) (/usr/include/c++/7/bits/std_function.h:706)
libtrantor.so.1!trantor::EventLoop::doRunInLoopFuncs(trantor::EventLoop * const this) (/home/tmoore/work/devel/a2b/core-service/repos/drogon/trantor/trantor/net/EventLoop.cc:302)
libtrantor.so.1!trantor::EventLoop::loop(trantor::EventLoop * const this) (/home/tmoore/work/devel/a2b/core-service/repos/drogon/trantor/trantor/net/EventLoop.cc:195)
libdrogon.so.1!drogon::HttpAppFrameworkImpl::run(drogon::HttpAppFrameworkImpl * const this) (/home/tmoore/work/devel/a2b/core-service/repos/drogon/lib/src/HttpAppFrameworkImpl.cc:598)
main(int argc, char ** argv) (/home/tmoore/work/devel/a2b/core-service/main.cc:147)

In this larger app I am cleaning up a connection that was closed. I am notified of the web socket close and later I clean up the web socket instances.

Do I need to ensure I cleanup a web socket only on the IO thread?

Sep 23 '21 13:09 tysenmoore-xse

I think you'd better use "value capture" instead of "reference capture".

Sep 23 '21 13:09 an-tao

Are you referring to the lamba function called from the queueInLoop? I can give that a try.

I changed it to drogon::app().getIOLoop(0)->queueInLoop([this]() { and still crashes.

Sep 23 '21 13:09 tysenmoore-xse

I was able to move to the latest trantor and my crashes appear to go away.

Sep 23 '21 15:09 tysenmoore-xse

drogon::app().getLoop()->queueInLoop([wsConn, count]() ...)

Sep 23 '21 15:09 an-tao

I tried that as well but does not compile (error: capture of non-variable ‘JsonRpcCtrl::wsConn’). I updated to the latest Drogon and still fighting these crashes. Same crash so trying to nail down the reason for the problem.

My larger app crashes within a couple test runs at: Channel::remove / assert(events_ == kNoneEvent);

The test app I provided seems solid under the master Drogon, just fighting my larger app.

Sep 23 '21 16:09 tysenmoore-xse

I may have come up with a solution for our larger app. Our larger app has a test case/use case where we do like that smaller app I sent except on the wsConn->disconnected() handling we get rid of our web socket client, etc. references. It seems this cleanup (or unreffing the shared_ptr) while in the IO thread processing/stack causes problems on the other calls. Through some trial and error I have found that by removing my wsClient/wsConn references on the app thread using drogon::app().getLoop()->runInLoop([this]() { seems to prevent the SO ERROR for connection reset and connection refused errors. If I attempted to cleanup on the IO thread it still crashes. I was now able to run my 54 tests 120 times which previously crashed within 1 to 5 times. Since I don't have a great feel for the problem I fear I only really altered the timing and I'm preventing the real problem from happening now--likely to come back later. Any thoughts? Or is this solution a fluke?

The basic code is:

drogon::app().getIOLoop(0)->queueInLoop([this]() {
  if ( wsConn )
  {
    if ( !wsConn->disconnected() )
    {
      wsConn->send(TEST_DATA);
    }
    else
    {
      wsClient->stop();

      // This is basically the location in our larger app where we stop() then 
      // release our wsClient/wsConn references.
    }
  }
});

Sep 23 '21 17:09 tysenmoore-xse

drogon drogon copied to clipboard

WebSocket Crash on Remote Disconnect

drogon
drogon copied to clipboard