drogon icon indicating copy to clipboard operation
drogon copied to clipboard

AddressSanitizer: heap-use-after-free in TCPClient::removeConnection

Open gschmottlach-xse opened this issue 2 years ago • 6 comments

Describe the bug While running my Drogon application with the AndressSanitizer enabled it will occasionally terminate my application because it detects the use of an object after it's been freed. This happens when the TcpClient::removeConnection() method is called. In my application the Drogon application has a websocket connection to a client where it sends events. In some scenarios the client abruptly terminates (e.g. forcefully disconnects) while the application is trying to send these events. This is detected by the application and it will attempt to "stop" the websocket connection and tear it down. This works 99.8% of the time but in some "racey" situations, it appears Drogon/Trantor is attempting to call TcpClient::removeConnection() after the TcpClient has already been destroyed.

I looked at the code and the only place I see TcpClient::removeConnection() being called is from the connection callback which is initialized in TcpClient::newConnection, e.g. conn->setCloseCallback(std::bind(&TcpClient::removeConnection, this, _1)); What has me concerned is that conn is a shared pointer and that callback function is being initialized by a raw this pointer to the TcpClient instance. It seems plausible there could be a scenario where the TcpClient instance has already been destroyed by the time the conn close callback is called. This seems to be the scenario I am seeing. I do see that TcpClient::~TcpClient destructor makes some effort to re-initialize the conn->setCloseCalllback() function but it's not clear if this is always being done within the scope/context of the destructor or scheduled to be done later in the event loop callback.

Is it possible to shed some light for the circumstances when TcpClient::removeConnection() should be called and why the raw this pointer being passed to the conn->setCloseCallback() method is not a potential problem? Again, I seem to see the issue more often when the Websocket client abruptly aborts the connection (e.g. no clean Websocket handshake to disconnect). How is the connection prevented from calling the TcpClient::removeConnection() method after it's been destroyed? Any insights or suggestions you could provide would be appreciated.

To Reproduce More frequent when websocket client connect is abruptly disconnected by a client application.

Expected behavior Don't call TcpClient::removeConnection() after TcpClient has been destroyed.

Screenshots N/A

Desktop (please complete the following information):

  • OS: Linux

gschmottlach-xse avatar Jan 12 '22 23:01 gschmottlach-xse

I've done some additional investigation into this bug and have verified, in fact, that there are situations where TcpClient::removeConncection() is being called from the connection callback on a TcpClient that has already been freed. In order to address this issue I propose the following patches:

diff --git a/trantor/net/TcpClient.cc b/trantor/net/TcpClient.cc
index cc409e6..73d8b66 100644
--- a/trantor/net/TcpClient.cc
+++ b/trantor/net/TcpClient.cc
@@ -82,25 +82,25 @@ TcpClient::~TcpClient()
     {
         std::lock_guard<std::mutex> lock(mutex_);
         conn = std::dynamic_pointer_cast<TcpConnectionImpl>(connection_);
-    }
-    if (conn)
-    {
-        assert(loop_ == conn->getLoop());
-        // TODO: not 100% safe, if we are in different thread
-        auto loop = loop_;
-        loop_->runInLoop([conn, loop]() {
-            conn->setCloseCallback([loop](const TcpConnectionPtr &connPtr) {
-                loop->queueInLoop([connPtr]() {
-                    static_cast<TcpConnectionImpl *>(connPtr.get())
-                        ->connectDestroyed();
+        if (conn)
+        {
+            assert(loop_ == conn->getLoop());
+            // TODO: not 100% safe, if we are in different thread
+            auto loop = loop_;
+            loop_->runInLoop([conn, loop]() {
+                conn->setCloseCallback([loop](const TcpConnectionPtr &connPtr) {
+                    loop->queueInLoop([connPtr]() {
+                        static_cast<TcpConnectionImpl *>(connPtr.get())
+                            ->connectDestroyed();
+                    });
                 });
             });
-        });
-        conn->forceClose();
-    }
-    else
-    {
-        connector_->stop();
+            conn->forceClose();
+        }
+        else
+        {
+            connector_->stop();
+        }
     }
 }
 
@@ -166,7 +166,23 @@ void TcpClient::newConnection(int sockfd)
     conn->setConnectionCallback(connectionCallback_);
     conn->setRecvMsgCallback(messageCallback_);
     conn->setWriteCompleteCallback(writeCompleteCallback_);
-    conn->setCloseCallback(std::bind(&TcpClient::removeConnection, this, _1));
+
+    std::weak_ptr<TcpClient> weakSelf(shared_from_this());
+    auto closeCb = std::function<void(const TcpConnectionPtr &)>([weakSelf](const TcpConnectionPtr & c) {
+        if ( auto self = weakSelf.lock() )
+        {
+            self->removeConnection(c);
+        }
+        // Else the TcpClient instance has already been destroyed
+        else
+        {
+            LOG_DEBUG << "TcpClient::removeConnection was skipped because TcpClient instanced already freed";
+            c->getLoop()->queueInLoop(
+                std::bind(&TcpConnectionImpl::connectDestroyed,
+                  std::dynamic_pointer_cast<TcpConnectionImpl>(c)));
+        }
+    });
+    conn->setCloseCallback(closeCb);
     {
         std::lock_guard<std::mutex> lock(mutex_);
         connection_ = conn;
diff --git a/trantor/net/TcpClient.h b/trantor/net/TcpClient.h
index a2fba76..1034be9 100644
--- a/trantor/net/TcpClient.h
+++ b/trantor/net/TcpClient.h
@@ -34,7 +34,7 @@ class SSLContext;
  * @brief This class represents a TCP client.
  *
  */
-class TRANTOR_EXPORT TcpClient : NonCopyable
+class TRANTOR_EXPORT TcpClient : NonCopyable, public std::enable_shared_from_this<TcpClient>
 {
   public:
     /**

The first problem in TcpClient::~TcpClient() is that it needs to hold on to the mutex protecting the connection while it is being used. Previously it was only momentarily locked while initially accessing it with the dynamic_pointer_cast(). I suspect concurrent access to the connection should be prevented within the scope of the TcpClient.

The second problem was devising a mechanism to prevent the connection's setCloseCb() from calling TcpClient::removeConnection() on an already deleted TcpClient instance. This is fixed by publicly inheriting TcpClient from std::enable_shared_from_this<TcpClient> and then creating a std::function<> that takes a weak pointer to the TcpClient and in the callback tries to lock and convert this weak pointer to a shared pointer. If this fails then the TcpClient instance has already been destroyed and the same clean-up code in TcpClient::removeConnection() is attempted on the connection in the callback. If the TcpClient instance is still alive then TcpClient::removeConnection() is called directly.

During this investigation and testing I discovered another bug in trantor/net/EventLoop.cc. Bascially, in the destructor EventLoop::~EventLoop() the assert assert(!looping_) was being triggered. The problem is the call to quit() in the destructor can execute quickly but the loop() function might take some time to notice and exit the while loop (setting looping_ = false). There is a small window here where the destructor's assert might be checked before the loop() function exits. My patch, although a bit kludgey, seems to work based on the assumption that EventLoop::loop() will eventually exit (fairly quickly) when quit_ = false. I simply loop waiting for this to occur in the EventLoop destructor instead of immediately asserting after the call to quit(). Also, since it seems two or more threads must be routinely calling methods on EventLoop I made the variables looping_ and quit_ volatile so the compiler wouldn't optimize out any reads from these variables. Since we don't care about retrieving a consistent (bool) value from looping_ and quit_ these are not made atomic types (std::atomic). We just need to eventually detect a consistent state change whether or not the change is temporarily inconsistent while the memory location of the bool is updated. Below is my patch:

diff --git a/trantor/net/EventLoop.cc b/trantor/net/EventLoop.cc
index 8a60a03..94e8a53 100644
--- a/trantor/net/EventLoop.cc
+++ b/trantor/net/EventLoop.cc
@@ -112,7 +112,17 @@ void EventLoop::resetAfterFork()
 EventLoop::~EventLoop()
 {
     quit();
-    assert(!looping_);
+
+    // Spin waiting for the loop to exit because
+    // this can take some time to complete.
+    // We could alternatively sleep (and block) this
+    // thread for a short time if
+    // looping_ == true and then re-check.
+    // If it still hasn't exited then one option
+    // might be to abort the program.
+    // assert(!looping_);
+    while ( looping_ );
+
     t_loopInThisThread = nullptr;
 #ifdef __linux__
     close(wakeupFd_);
diff --git a/trantor/net/EventLoop.h b/trantor/net/EventLoop.h
index d28f6b8..1278726 100644
--- a/trantor/net/EventLoop.h
+++ b/trantor/net/EventLoop.h
@@ -287,9 +287,9 @@ class TRANTOR_EXPORT EventLoop : NonCopyable
     void abortNotInLoopThread();
     void wakeup();
     void wakeupRead();
-    bool looping_;
+    volatile bool looping_;
     std::thread::id threadId_;
-    bool quit_;
+    volatile bool quit_;
     std::unique_ptr<Poller> poller_;
 
     ChannelList activeChannels_;

I hope you have the opportunity to consider these changes. It's highly likely you want to implement these changes differently or realize that other (related) changes should also be made as a result of my patches. I'm interested in any feedback you can offer and I hope you'll be able to integrate these changes (in some form) quickly into the master branch. Please let me know if you have any questions about my implementation.

Thanks for maintaining and making this framework available to all!

gschmottlach-xse avatar Jan 14 '22 22:01 gschmottlach-xse

@gschmottlach-xse Thanks so much for your investigation. Would you like to make a PR to the trantor repo?

an-tao avatar Jan 15 '22 03:01 an-tao

I have made two PR's for the trantor repo. One PR addresses the TCPClient issue with removeConnection() being called after the TcpClient has already been destroyed. The second PR has the EventLoop destructor fix so it waits (indefinitely) for the loop to exit in the destructor instead of immediately asserting. Please review and make any changes you deem necessary. I look forward to these PRs being merged into Trantor/Drogon master.

Thanks!

gschmottlach-xse avatar Jan 17 '22 16:01 gschmottlach-xse

@gschmottlach-xse Thanks so much for your patch. I'll take some time to review.

an-tao avatar Jan 18 '22 01:01 an-tao

Any progress towards reviewing and integrating my Trantor PR and ultimately pulling it into Drogon?

gschmottlach-xse avatar Jan 26 '22 13:01 gschmottlach-xse

Sorry for the delay merging of https://github.com/an-tao/trantor/pull/197, please resolve the compilation error on windows of https://github.com/an-tao/trantor/pull/198. Thanks so much.

an-tao avatar Jan 26 '22 15:01 an-tao