asio
asio copied to clipboard
SIGSEGV on scheduler::compensating_work_started()
@cotti commented on Sep 26, 2018, 6:03 PM UTC:
We have encountered multiple times what appears to be the same issue as described in https://svn.boost.org/trac10/ticket/13562 on a server application that makes heavy use of asio to send HTTP requests and read their responses.
Our usage is fairly standard: upon getting a connection, we call boost::asio::async_write() binding the next method, onWrite(), and inside it we call boost::asio::async_read_until(), binding onRead() which performs the next step of reading the response to our request, and so on. Our Async HTTP threads are quite short-lived: on our logs their whole life-cycle takes less than a second. We find the threads take a certain amount of time without activity before they are hit with a SIGSEGV - it can vary between a few seconds up to more than a minute, and it seems to happen between onWrite() and onRead().
We were using boost 1.66.0 up until a couple of weeks ago. After this issue occurred a few times, we tried an upgrade to boost 1.68.0, but if anything the frequency this issue is ocurring has increased to almost daily.
Received signal: 11 - SIGSEGV. Segmentation violation
Thread: A-Http-4816 - 140326043744000
Stack trace:
/server: SignalHandler::getSignalInformation(int)()+0x191
/server() [0x144a973]
/lib64/libc.so.6: ()+0x35250
/server: boost::asio::detail::scheduler::compensating_work_started()()+0x20
/server: boost::asio::detail::epoll_reactor::perform_io_cleanup_on_block_exit::~perform_io_cleanup_on_block_exit()()+0x63
/server: boost::asio::detail::epoll_reactor::descriptor_state::perform_io(unsigned int)()+0x16d
/server: boost::asio::detail::epoll_reactor::descriptor_state::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)()+0x3f
/server: boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long)()+0x32
/server: boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&)()+0x1a8
/server: boost::asio::detail::scheduler::run(boost::system::error_code&)()+0x10e
/server: boost::asio::io_context::run()()+0x2f
/server: boost::_mfi::mf0<unsigned long, boost::asio::io_context>::operator()(boost::asio::io_context*) const()+0x65
/server: unsigned long boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> >::operator()<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list0>(boost::_bi::type<unsigned long>, boost::_mfi::mf0<unsigned long, boost::asio::io_context>&, boost::_bi::list0&, long)()+0x4b
/server: boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > >::operator()()()+0x39
/server: unsigned long std::_Bind_simple<boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > > ()>::_M_invoke<>(std::_Index_tuple<>)()+0x28
/server: std::_Bind_simple<boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > > ()>::operator()()()+0x1b
/server: std::thread::_Impl<std::_Bind_simple<boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > > ()> >::_M_run()()+0x1c
/lib64/libstdc++.so.6: ()+0xb5230
/lib64/libpthread.so.0: ()+0x7dc5
/lib64/libc.so.6: clone()+0x6d
This issue was moved by chriskohlhoff from boostorg/asio#150.
@markand commented on Oct 31, 2018, 12:27 PM UTC:
Hi, I'm affected by this bug as well.
Building my application with BOOST_ASIO_DISABLE_EPOLL
works fine, otherwise, I get random crash exactly as cotti.
It happens on Arch Linux, boost 1.68, linux 4.18.
@oscarfv commented on Mar 25, 2019, 7:42 PM UTC:
On an internal project I can reproduce the problem always. Defining BOOST_ASIO_DISABLE_EPOLL
does not fix the crash here.
In my code, the calls to asio
are inside wrapper functions compiled in a shared library. If I move those calls to the main executable, there is no crash.
@joshedwards22 commented on Mar 2, 2020, 5:00 PM UTC:
Any update on this? I have encountered this when sharing an io_context with runtime loaded shared libraries on linux. It appears that the "thread_call_stack::contains(this)" is depending on a static variable that does not exist across the module boundary. This "compensating_work_started" seems to be called at random times and for what reason I cannot figure out.
@djarek commented on Mar 2, 2020, 9:41 PM UTC:
joshedwards22 Exposing ASIO classes across ABI boundaries is a bad idea - ASIO doesn't make any ABI stability guarantees AFAIK.
@lverbe commented on Mar 12, 2020, 10:15 PM UTC:
This patch fixes it for Boost 1.72.0.
--- ./boost/asio/detail/impl/scheduler.ipp.orig 2020-03-12 11:00:06.823085227 -0600
+++ ./boost/asio/detail/impl/scheduler.ipp 2020-03-12 11:01:30.898891690 -0600
@@ -317,8 +317,8 @@ void scheduler::restart()
void scheduler::compensating_work_started()
{
- thread_info_base* this_thread = thread_call_stack::contains(this);
- ++static_cast<thread_info*>(this_thread)->private_outstanding_work;
+ if (thread_info_base* this_thread = thread_call_stack::contains(this))
+ ++static_cast<thread_info*>(this_thread)->private_outstanding_work;
}
void scheduler::post_immediate_completion(
@dillaman commented on Nov 9, 2020, 1:35 PM UTC:
I am also encountering this issue when trying to use boost::asio
from within shared libraries. In the Ceph project, we are trying to incorporate boost::asio
in our client-side libraries librados
and librbd
, but that results in each shared library's bss section getting its own boost::asio::detail::call_stack<boost::asio::detail::thread_context, boost::asio::detail::thread_info_base>::top_
and therefore we randomly hit this crash.
@rpopescu commented on Dec 18, 2020, 1:47 PM UTC:
This patch fixes it for Boost 1.72.0.
--- ./boost/asio/detail/impl/scheduler.ipp.orig 2020-03-12 11:00:06.823085227 -0600 +++ ./boost/asio/detail/impl/scheduler.ipp 2020-03-12 11:01:30.898891690 -0600 @@ -317,8 +317,8 @@ void scheduler::restart() void scheduler::compensating_work_started() { - thread_info_base* this_thread = thread_call_stack::contains(this); - ++static_cast<thread_info*>(this_thread)->private_outstanding_work; + if (thread_info_base* this_thread = thread_call_stack::contains(this)) + ++static_cast<thread_info*>(this_thread)->private_outstanding_work; } void scheduler::post_immediate_completion(
I'm really surprised to see this patch not being applied; is there a reason for this?
@oscarfv commented on Dec 18, 2020, 2:30 PM UTC:
lverbe , rpopescu : maybe chriskohlhoff does not monitor this issue tracker.
@rpopescu commented on Dec 18, 2020, 4:33 PM UTC:
oscarfv do you know what he does monitor? the trac ticket is 3 years old it seems: https://svn.boost.org/trac10/ticket/13562 thanks.
@oscarfv commented on Dec 18, 2020, 5:09 PM UTC:
rpopescu : no idea. It seems that he keeps working on https://think-async.com/Asio/ and its Boost incarnation, but I see no way of contacting him on that webpage. Let's wait a bit, maybe he notices the mention on my prior message.
So, for us this happens when io_context
is passed across shared library boundary. The thing is, boost::asio seems to be a header-only, and thread_call_stack::contains()
seems to rely on a static/global variable. So, it seems like this is the case of codes in different shared libraries get different instances of a global variable?
Last week I ran into the same issue and I've found that it happens when there is a shared library involved and there is incoming data from a connected client without a active async_read.
The issue can be reproduced by running the attached asio_segfault code. asio_segfault.zip
The plugin code wraps the asio tcp async echo server example in a shared library. The echo_server.h code is modified to not call do_read after the data has been returned to the client The main code creates a io_context and passes this to by the plugin created server.
When this code is executed the server code waits for a connection and accepts the first line of text. The server echos the received text and, with the modifications done to the echo_server.h, does not wait for a new data. When the client sends a new line of text the application crashed with a segmentation fault. With BOOST_ASIO_DISABLE_EPOLL set the code keeps on running
Information on our setup:
- ARM32
- GCC 9.3.0
- Boost 1.72
@pvd : setting BOOST_ASIO_DISABLE_EPOLL
makes things worse here (without it, some test cases succeed, with it, all test cases fail.) Debian Bookworm with Clang 15 Boost 1.78.
I think @peat-psuwit pinpointed the problem: using static variables on a header-only library.
@oscarfv We have been running with EPOLL disabled for this week and not segfaults so far; I after we switch EPOLL off an bug in our code was found. The code was using a async_write with a temp. buffer that got deleted before the write was executed. This has now been fixed by replacing the async_write with a sync write.
I correct my previous claim about BOOST_ASIO_DISABLE_EPOLL
not fixing the crash. Indeed, the crash goes away. Thanks @pvd and sorry for the noise.
We also ran into this issue (using asio 1.20.0), any news on this? @chriskohlhoff
Facing similar issues, also on ARM. Anything we can provide to help debugging?
This still seems to be a problem on the latest version of Asio, even with DISABLE_EPOLL
. I'm experiencing a segmentation fault due to a null asio::call_stack<...>::top_
when using Asio across shared libraries. #780 appears to be very similar or the same issue.