asio SIGSEGV on scheduler::compensating_work

@cotti commented on Sep 26, 2018, 6:03 PM UTC:

We have encountered multiple times what appears to be the same issue as described in https://svn.boost.org/trac10/ticket/13562 on a server application that makes heavy use of asio to send HTTP requests and read their responses.

Our usage is fairly standard: upon getting a connection, we call boost::asio::async_write() binding the next method, onWrite(), and inside it we call boost::asio::async_read_until(), binding onRead() which performs the next step of reading the response to our request, and so on. Our Async HTTP threads are quite short-lived: on our logs their whole life-cycle takes less than a second. We find the threads take a certain amount of time without activity before they are hit with a SIGSEGV - it can vary between a few seconds up to more than a minute, and it seems to happen between onWrite() and onRead().

We were using boost 1.66.0 up until a couple of weeks ago. After this issue occurred a few times, we tried an upgrade to boost 1.68.0, but if anything the frequency this issue is ocurring has increased to almost daily.

        Received signal: 11 - SIGSEGV. Segmentation violation
        Thread: A-Http-4816 - 140326043744000
        Stack trace:
          /server: SignalHandler::getSignalInformation(int)()+0x191
          /server() [0x144a973]
          /lib64/libc.so.6: ()+0x35250
          /server: boost::asio::detail::scheduler::compensating_work_started()()+0x20
          /server: boost::asio::detail::epoll_reactor::perform_io_cleanup_on_block_exit::~perform_io_cleanup_on_block_exit()()+0x63
          /server: boost::asio::detail::epoll_reactor::descriptor_state::perform_io(unsigned int)()+0x16d
          /server: boost::asio::detail::epoll_reactor::descriptor_state::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)()+0x3f
          /server: boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long)()+0x32
          /server: boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&)()+0x1a8
          /server: boost::asio::detail::scheduler::run(boost::system::error_code&)()+0x10e
          /server: boost::asio::io_context::run()()+0x2f
          /server: boost::_mfi::mf0<unsigned long, boost::asio::io_context>::operator()(boost::asio::io_context*) const()+0x65
          /server: unsigned long boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> >::operator()<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list0>(boost::_bi::type<unsigned long>, boost::_mfi::mf0<unsigned long, boost::asio::io_context>&, boost::_bi::list0&, long)()+0x4b
          /server: boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > >::operator()()()+0x39
          /server: unsigned long std::_Bind_simple<boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > > ()>::_M_invoke<>(std::_Index_tuple<>)()+0x28
          /server: std::_Bind_simple<boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > > ()>::operator()()()+0x1b
          /server: std::thread::_Impl<std::_Bind_simple<boost::_bi::bind_t<unsigned long, boost::_mfi::mf0<unsigned long, boost::asio::io_context>, boost::_bi::list1<boost::_bi::value<boost::asio::io_context*> > > ()> >::_M_run()()+0x1c
          /lib64/libstdc++.so.6: ()+0xb5230
          /lib64/libpthread.so.0: ()+0x7dc5
          /lib64/libc.so.6: clone()+0x6d

This issue was moved by chriskohlhoff from boostorg/asio#150.

Dec 30 '20 00:12 ghost

@markand commented on Oct 31, 2018, 12:27 PM UTC:

Hi, I'm affected by this bug as well.

Building my application with BOOST_ASIO_DISABLE_EPOLL works fine, otherwise, I get random crash exactly as cotti.

It happens on Arch Linux, boost 1.68, linux 4.18.

Dec 30 '20 00:12 ghost

@oscarfv commented on Mar 25, 2019, 7:42 PM UTC:

On an internal project I can reproduce the problem always. Defining BOOST_ASIO_DISABLE_EPOLL does not fix the crash here.

In my code, the calls to asio are inside wrapper functions compiled in a shared library. If I move those calls to the main executable, there is no crash.

Dec 30 '20 00:12 ghost

@joshedwards22 commented on Mar 2, 2020, 5:00 PM UTC:

Any update on this? I have encountered this when sharing an io_context with runtime loaded shared libraries on linux. It appears that the "thread_call_stack::contains(this)" is depending on a static variable that does not exist across the module boundary. This "compensating_work_started" seems to be called at random times and for what reason I cannot figure out.

Dec 30 '20 00:12 ghost

@djarek commented on Mar 2, 2020, 9:41 PM UTC:

joshedwards22 Exposing ASIO classes across ABI boundaries is a bad idea - ASIO doesn't make any ABI stability guarantees AFAIK.

Dec 30 '20 00:12 ghost

@lverbe commented on Mar 12, 2020, 10:15 PM UTC:

This patch fixes it for Boost 1.72.0.

--- ./boost/asio/detail/impl/scheduler.ipp.orig	2020-03-12 11:00:06.823085227 -0600
+++ ./boost/asio/detail/impl/scheduler.ipp	2020-03-12 11:01:30.898891690 -0600
@@ -317,8 +317,8 @@ void scheduler::restart()

 void scheduler::compensating_work_started()
 {
-  thread_info_base* this_thread = thread_call_stack::contains(this);
-  ++static_cast<thread_info*>(this_thread)->private_outstanding_work;
+  if (thread_info_base* this_thread = thread_call_stack::contains(this))
+    ++static_cast<thread_info*>(this_thread)->private_outstanding_work;
 }

 void scheduler::post_immediate_completion(

Dec 30 '20 01:12 ghost

@dillaman commented on Nov 9, 2020, 1:35 PM UTC:

I am also encountering this issue when trying to use boost::asio from within shared libraries. In the Ceph project, we are trying to incorporate boost::asio in our client-side libraries librados and librbd, but that results in each shared library's bss section getting its own boost::asio::detail::call_stack<boost::asio::detail::thread_context, boost::asio::detail::thread_info_base>::top_ and therefore we randomly hit this crash.

Dec 30 '20 01:12 ghost

@rpopescu commented on Dec 18, 2020, 1:47 PM UTC:

This patch fixes it for Boost 1.72.0.

--- ./boost/asio/detail/impl/scheduler.ipp.orig	2020-03-12 11:00:06.823085227 -0600
+++ ./boost/asio/detail/impl/scheduler.ipp	2020-03-12 11:01:30.898891690 -0600
@@ -317,8 +317,8 @@ void scheduler::restart()
 
 void scheduler::compensating_work_started()
 {
-  thread_info_base* this_thread = thread_call_stack::contains(this);
-  ++static_cast<thread_info*>(this_thread)->private_outstanding_work;
+  if (thread_info_base* this_thread = thread_call_stack::contains(this))
+    ++static_cast<thread_info*>(this_thread)->private_outstanding_work;
 }
 
 void scheduler::post_immediate_completion(

I'm really surprised to see this patch not being applied; is there a reason for this?

Dec 30 '20 01:12 ghost

@oscarfv commented on Dec 18, 2020, 2:30 PM UTC:

lverbe , rpopescu : maybe chriskohlhoff does not monitor this issue tracker.

Dec 30 '20 01:12 ghost

@rpopescu commented on Dec 18, 2020, 4:33 PM UTC:

oscarfv do you know what he does monitor? the trac ticket is 3 years old it seems: https://svn.boost.org/trac10/ticket/13562 thanks.

Dec 30 '20 01:12 ghost

@oscarfv commented on Dec 18, 2020, 5:09 PM UTC:

rpopescu : no idea. It seems that he keeps working on https://think-async.com/Asio/ and its Boost incarnation, but I see no way of contacting him on that webpage. Let's wait a bit, maybe he notices the mention on my prior message.

Dec 30 '20 01:12 ghost

So, for us this happens when io_context is passed across shared library boundary. The thing is, boost::asio seems to be a header-only, and thread_call_stack::contains() seems to rely on a static/global variable. So, it seems like this is the case of codes in different shared libraries get different instances of a global variable?

Dec 29 '21 16:12 peat-psuwit

Last week I ran into the same issue and I've found that it happens when there is a shared library involved and there is incoming data from a connected client without a active async_read.

The issue can be reproduced by running the attached asio_segfault code. asio_segfault.zip

The plugin code wraps the asio tcp async echo server example in a shared library. The echo_server.h code is modified to not call do_read after the data has been returned to the client The main code creates a io_context and passes this to by the plugin created server.

When this code is executed the server code waits for a connection and accepts the first line of text. The server echos the received text and, with the modifications done to the echo_server.h, does not wait for a new data. When the client sends a new line of text the application crashed with a segmentation fault. With BOOST_ASIO_DISABLE_EPOLL set the code keeps on running

Information on our setup:

ARM32
GCC 9.3.0
Boost 1.72

Sep 05 '22 09:09 pvd

@pvd : setting BOOST_ASIO_DISABLE_EPOLL makes things worse here (without it, some test cases succeed, with it, all test cases fail.) Debian Bookworm with Clang 15 Boost 1.78.

I think @peat-psuwit pinpointed the problem: using static variables on a header-only library.

Sep 05 '22 14:09 oscarfv

@oscarfv We have been running with EPOLL disabled for this week and not segfaults so far; I after we switch EPOLL off an bug in our code was found. The code was using a async_write with a temp. buffer that got deleted before the write was executed. This has now been fixed by replacing the async_write with a sync write.

Sep 09 '22 09:09 pvd

I correct my previous claim about BOOST_ASIO_DISABLE_EPOLL not fixing the crash. Indeed, the crash goes away. Thanks @pvd and sorry for the noise.

Sep 11 '22 00:09 oscarfv

We also ran into this issue (using asio 1.20.0), any news on this? @chriskohlhoff

Nov 30 '22 09:11 KerstinKeller

Facing similar issues, also on ARM. Anything we can provide to help debugging?

Nov 11 '23 00:11 zahmad-procentec

This still seems to be a problem on the latest version of Asio, even with DISABLE_EPOLL. I'm experiencing a segmentation fault due to a null asio::call_stack<...>::top_ when using Asio across shared libraries. #780 appears to be very similar or the same issue.

Mar 27 '24 00:03 cbrl

asio
asio copied to clipboard

SIGSEGV on scheduler::compensating_work_started()

asio asio copied to clipboard

SIGSEGV on scheduler::compensating_work_started()

asio
asio copied to clipboard