redis icon indicating copy to clipboard operation
redis copied to clipboard

Dead lock if cancellation is requested while there is a PING request on the fly

Open nejati-deriv opened this issue 2 years ago • 7 comments

I'm not sure but problem seems to be here. If I am not mistaken you used close_on_run_completion flag exactly to prevent this situation but I don't know why it leads to dead lock any way.

You can reproduce it with a loop like:

for (;;)
{
    auto timer = asio::steady_timer{ executor, std::chrono::seconds{ 1 } };
    co_await (connection.async_run(endpoint, asio::use_awaitable) || timer.async_wait(asio::use_awaitable));
}

if you change timer to something shorter than ping time like 800ms it will not lead to dead lock.

nejati-deriv avatar Sep 22 '22 14:09 nejati-deriv

Can you please try what I said here https://github.com/mzimbres/aedis/issues/28#issuecomment-1255214596 and let me know if the problem persists.

mzimbres avatar Sep 22 '22 15:09 mzimbres

I can't reproduce this behaviour. I have used this code https://github.com/mzimbres/aedis/issues/28#issuecomment-1256921527 with a timeout of 1s and 2s and it keeps printing retrying. A deadlock may occur for example when Redis for any reason sends unsolicited messages and you don't consume them, can you please add a coroutine to consume them, for example

net::awaitable<void> push_receiver(std::shared_ptr<connection> db)
{
   for (std::vector<node<std::string>> resp;;) {
      co_await db->async_receive_push(adapt(resp));
      print_push(resp);
      resp.clear();
   }
}

mzimbres avatar Sep 24 '22 09:09 mzimbres

Adding more connections increases the chance:


#include <boost/asio.hpp>
#include <boost/asio/experimental/awaitable_operators.hpp>

#include <aedis.hpp>
#include <chrono>
#include <iostream>

namespace net = boost::asio;
using namespace net::experimental::awaitable_operators;
using aedis::endpoint;
using connection = aedis::connection<>;

net::awaitable< void >
reconnect()
{
    auto db1   = connection{ co_await net::this_coro::executor };
    auto db2   = connection{ co_await net::this_coro::executor };
    auto db3   = connection{ co_await net::this_coro::executor };
    auto db4   = connection{ co_await net::this_coro::executor };
    auto timer = net::steady_timer{ co_await net::this_coro::executor };
    for (int i = 0;; i++)
    {
        timer.expires_after(std::chrono::milliseconds{ 1005 }); // Note extra 5ms
        endpoint ep{ "127.0.0.1", "6379" };
        co_await (db1.async_run(ep, net::use_awaitable) || db2.async_run(ep, net::use_awaitable) ||
                  db3.async_run(ep, net::use_awaitable) || db4.async_run(ep, net::use_awaitable) ||
                  timer.async_wait(net::use_awaitable));
        std::cout << i << std::endl;
    }
}

int
main()
{
    net::io_context ioc;
    net::co_spawn(ioc, reconnect(), net::detached);
    ioc.run();
}

nejati-deriv avatar Sep 25 '22 16:09 nejati-deriv

Does this trigger a sanitizer check on your machine?

mzimbres avatar Sep 25 '22 16:09 mzimbres

Yes alloc-dealloc-mismatch and use-of-uninitialized-value with a gigantic error messages. alloc-dealloc-mismatch.txt use-of-uninitialized-value.txt

nejati-deriv avatar Sep 25 '22 17:09 nejati-deriv

I have made many improvements in cancellation in master. I can't reproduce this problem anymore, so it might have been fixed.

mzimbres avatar Oct 23 '22 20:10 mzimbres

Thanks for you work, I can still reproduce it with the following code: (delay 1 ms)

#include <boost/asio.hpp>
#include <boost/asio/experimental/awaitable_operators.hpp>

#include <aedis.hpp>
#include <chrono>
#include <iostream>

namespace net = boost::asio;
using namespace net::experimental::awaitable_operators;
using aedis::endpoint;
using connection = aedis::connection<>;

net::awaitable< void >
reconnect()
{
    auto db1   = connection{ co_await net::this_coro::executor };
    auto db2   = connection{ co_await net::this_coro::executor };
    auto db3   = connection{ co_await net::this_coro::executor };
    auto db4   = connection{ co_await net::this_coro::executor };
    auto timer = net::steady_timer{ co_await net::this_coro::executor };
    for (int i = 0;; i++)
    {
        timer.expires_after(std::chrono::milliseconds{ 1 });   // Note 1 ms
        endpoint ep{ "172.17.0.1", "6379" };
        co_await (db1.async_run(ep, {}, net::use_awaitable) || db2.async_run(ep, {}, net::use_awaitable) ||
                  db3.async_run(ep, {}, net::use_awaitable) || db4.async_run(ep, {}, net::use_awaitable) ||
                  timer.async_wait(net::use_awaitable));
        std::cout << i << std::endl;
    }
}

int
main()
{
    net::io_context ioc;
    net::co_spawn(ioc, reconnect(), net::detached);
    ioc.run();
}

nejati-deriv avatar Oct 24 '22 09:10 nejati-deriv

Thanks for insisting :) I learned many sutilities in the Asio cancellation mechanism. This commit should fix the problem https://github.com/mzimbres/aedis/commit/bac27c1770a14c47cd802eb556983d14e93df013. Let me know if it works for you.

mzimbres avatar Oct 25 '22 19:10 mzimbres

Seems that fixed the problem, thanks :partying_face:

nejati-deriv avatar Oct 26 '22 07:10 nejati-deriv