asio
asio copied to clipboard
boost asio io_context run hangs in a thread
We have a server application (based on 1.82) that hangs randomly after a few hours of running. I have a pool of I/O threads (workers) to handle accept, read & write asynchronously. At random points, no connections are accepted, nor any data is read.
I suspected that io_context
was running out of work. Even after adding work_guard
, problem persists. When monitoring code attempts to restart the listener, one of the I/O threads is stuck as shown later. Only timer I have is idle connection close. There is a timer per client connection. Timer gets updated after each async_read
so it can close correctly if there are no bytes received for 5 minutes.
I have enabled BOOST_ASIO_ENABLE_HANDLER_TRACKING
and made sure all handlers are returning -- no thread is in the middle of a handler function. In other words, control returns to asio, however, no epoll events are being triggered (after running perfectly fine for few hours).
All I/O threads are waiting here except for one thread that's shown later:
#0 __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5622ec3bfe1c) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x5622ec3bfe1c) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5622ec3bfe1c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0)
at ./nptl/futex-internal.c:139
#3 0x00007f37fee6aac1 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5622ec3bfdb8, cond=0x5622ec3bfdf0) at ./nptl/pthread_cond_wait.c:503
#4 ___pthread_cond_wait (cond=0x5622ec3bfdf0, mutex=0x5622ec3bfdb8) at ./nptl/pthread_cond_wait.c:627
#5 0x00005622ebf49a11 in boost::asio::detail::posix_event::wait<boost::asio::detail::conditionally_enabled_mutex::scoped_lock> (this=0x5622ec3bfdf0, lock=...)
at <path>/extern/boost/linux/include/boost/asio/detail/posix_event.hpp:119
#6 0x00005622ebf4105e in boost::asio::detail::conditionally_enabled_event::wait (this=0x5622ec3bfde8, lock=...)
at <path>/extern/boost/linux/include/boost/asio/detail/conditionally_enabled_event.hpp:97
#7 0x00005622ebf4486a in boost::asio::detail::scheduler::do_run_one (this=0x5622ec3bfd80, lock=..., this_thread=..., ec=...)
at <path>/extern/boost/linux/include/boost/asio/detail/impl/scheduler.ipp:502
#8 0x00005622ebf441c3 in boost::asio::detail::scheduler::run (this=0x5622ec3bfd80, ec=...)
at <path>/extern/boost/linux/include/boost/asio/detail/impl/scheduler.ipp:210
#9 0x00005622ebf7e1f1 in boost::asio::io_context::run (this=0x7ffd4139c540) at <path>/extern/boost/linux/include/boost/asio/impl/io_context.ipp:64
#10 0x00005622ebf934f3 in boost::_mfi::mf0<unsigned long, boost::asio::io_context>::operator() (this=0x5622ec3d11d8, p=0x7ffd4139c540)
at <path>/extern/boost/linux/include/boost/bind/mem_fn_template.hpp:49
#11 0x00005622ebf93428 in boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> >::operator()<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list0> (
this=0x5622ec3d11e8, f=..., a=...) at <path>/extern/boost/linux/include/boost/bind/bind.hpp:229
#12 0x00005622ebf931ea in boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > >::operator() (
this=0x5622ec3d11d8) at <path>/extern/boost/linux/include/boost/bind/bind.hpp:1274
#13 0x00005622ebf92396 in boost::detail::thread_data<boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > > >::run (this=0x5622ec3d10a0) at <path>/extern/boost/linux/include/boost/thread/detail/thread.hpp:120
#14 0x00005622ebf9f0db in thread_proxy ()
#15 0x00007f37fee6bb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#16 0x00007f37feefda00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
And one thread is stuck inside the timer like this:
#0 boost::asio::detail::op_queue<boost::asio::detail::wait_op>::front (this=0x7f3790001210) at <path>/extern/boost/linux/include/boost/asio/detail/op_queue.hpp:86
#1 0x00005622ebf6427f in boost::asio::detail::timer_queue<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock> > >::get_ready_timers (this=0x7f37900011f8, ops=...) at <path>/extern/boost/linux/include/boost/asio/detail/timer_queue.hpp:154
#2 0x00005622ebf41a3e in boost::asio::detail::timer_queue_set::get_ready_timers (this=0x5622ec3bff60, ops=...)
at <path>/extern/boost/linux/include/boost/asio/detail/impl/timer_queue_set.ipp:88
#3 0x00005622ebf430d5 in boost::asio::detail::epoll_reactor::run (this=0x5622ec3bfee0, usec=-1, ops=...)
at <path>/extern/boost/linux/include/boost/asio/detail/impl/epoll_reactor.ipp:593
#4 0x00005622ebf44799 in boost::asio::detail::scheduler::do_run_one (this=0x5622ec3bfd80, lock=..., this_thread=..., ec=...)
at <path>/extern/boost/linux/include/boost/asio/detail/impl/scheduler.ipp:477
#5 0x00005622ebf441c3 in boost::asio::detail::scheduler::run (this=0x5622ec3bfd80, ec=...)
at <path>/extern/boost/linux/include/boost/asio/detail/impl/scheduler.ipp:210
#6 0x00005622ebf7e1f1 in boost::asio::io_context::run (this=0x7ffd4139c540) at <path>/extern/boost/linux/include/boost/asio/impl/io_context.ipp:64
#7 0x00005622ebf934f3 in boost::_mfi::mf0<unsigned long, boost::asio::io_context>::operator() (this=0x5622ec3d4e68, p=0x7ffd4139c540)
at <path>/extern/boost/linux/include/boost/bind/mem_fn_template.hpp:49
#8 0x00005622ebf93428 in boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> >::operator()<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list0> (
this=0x5622ec3d4e78, f=..., a=...) at <path>/extern/boost/linux/include/boost/bind/bind.hpp:229
#9 0x00005622ebf931ea in boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > >::operator() (
this=0x5622ec3d4e68) at <path>/extern/boost/linux/include/boost/bind/bind.hpp:1274
#10 0x00005622ebf92396 in boost::detail::thread_data<boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > > >::run (this=0x5622ec3d4d30) at <path>/extern/boost/linux/include/boost/thread/detail/thread.hpp:120
#11 0x00005622ebf9f0db in thread_proxy ()
#12 0x00007f37fee6bb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#13 0x00007f37feefda00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
In a nutshell, I have a steady_timer
per connection. Upon accepting a new connection, this is set to expires_after
5 minutes. Then any time async_read_some
invokes on_data_received
handler, timer is restarted. IF and when timer fires, connection is dropped, which never happens in problem environment. Without this timer, there are absolutely no issues. Also, we are unable to re-create this issue in DEV environment, nor in the QA lab. Issue occurs only in customer environment that is running the same OS as DEV or QA. Customer does have a load balancer that establishes TCP connections & disconnect immediately. This was tested in DEV environment also -- connect & disconnect numerous times just like the load balancer -- unable to recreate the issue in house. However, customer consistently encounters this problem after running fine for few hours.
Are you using the timer from multiple threads without synchronization ?
@kiwixz Here are the snippets from server code:
i/o thread pool is set up like this
for (std::size_t i = 0; i < thread_pool_size; ++i)
{
boost::shared_ptr<boost::thread> thread(boost::make_shared<boost::thread>(
boost::bind(&boost::asio::io_context::run, &(io_context::member))));
if (thread) m_ioThreads.push_back(thread);
}
Within the thread pool, new connections are accepted
acceptor->async_accept(spAnticipatedConnection->GetSocket(),
boost::bind(&Server::OnNewConnection, this,
boost::asio::placeholders::error, spAnticipatedConnection, acceptor));
OnNewConnection just creates IncomingConnection shared pointer, which issues
m_socket.async_read_some(boost::asio::buffer(m_recvBuffer),
boost::asio::bind_executor(m_strand,
boost::bind(&IncomingConnection::OnDataReceived, shared_from_this(),
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred)));
Every time some bytes are received, OnDataReceived does the following to the timer
m_deadlineTimer.expires_after(std::chrono::seconds(m_maxMaxAgentIdleSeconds));
m_deadlineTimer.async_wait(boost::asio::bind_executor(m_strand,
boost::bind(&IncomingConnection::OnTimer, this, boost::asio::placeholders::error)));
To answer your question, there is a strand per connection. Timer is queued through that strand as shown above.
If & When OnTimer fires, idea is to close the connection because no data received for some time. This never fires, that's not the problem. Problem is that asio stops dispatching to any handler, no new connections are accepted, no data is read on existing connection, of course no timer fires to close idle connections, even if client closes the connection, server is unaware.
Shouldn't the timer async_wait keep a reference to the shared_ptr ?
IncomingConnection destructor & CloseConnection issue m_deadlineTimer.cancel(); Are you saying steady_timer could continue to hold "this" (IncomingConnection pointer) even after IncomingConnection object is cleaned up? m_deadlineTimer itself is a member of IncomingConnection Also, m_strand is a member of IncomingConnection as well. But it uses the same io_context that that the Server class created during listen/accept/threadpool
Timer's cancel actually queue the callback to be called, so you cant just destroy it.
@equippedmandingo were you able to solve it ?