cppcoro icon indicating copy to clipboard operation
cppcoro copied to clipboard

test crash at async_auto_reset_event_tests.cpp:105/TEST_CASE("multi-threaded")

Open yfinkelstein opened this issue 5 years ago • 7 comments

On Mac OSX, using clang 7

$ clang -v
clang version 7.0.0 (tags/RELEASE_700/final)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /usr/local/opt/llvm/bin

the test in headline crashes in an apparent infinite recursion:

...
    frame #2449: 0x000000010028c134 cppcoro_tests`cppcoro::async_auto_reset_event::resume_waiters(this=0x0000000101803640, initialState=81604378625) const at async_auto_reset_event.cpp:220
    frame #2450: 0x000000010028bb14 cppcoro_tests`cppcoro::async_auto_reset_event::set(this=0x0000000101803640) at async_auto_reset_event.cpp:98
    frame #2451: 0x0000000100016ec5 cppcoro_tests`_DOCTEST_ANON_FUNC_8()::$_4::operator()() const::'lambda'()::operator()() const at async_auto_reset_event_tests.cpp:105
    frame #2452: 0x000000010028c13f cppcoro_tests`cppcoro::async_auto_reset_event::resume_waiters(unsigned long long) const [inlined] std::experimental::coroutines_v1::coroutine_handle<void>::resume(this=0x000000010129dd08) at coroutine:123
    frame #2453: 0x000000010028c134 cppcoro_tests`cppcoro::async_auto_reset_event::resume_waiters(this=0x0000000101803640, initialState=73014444033) const at async_auto_reset_event.cpp:220
    frame #2454: 0x000000010028bb14 cppcoro_tests`cppcoro::async_auto_reset_event::set(this=0x0000000101803640) at async_auto_reset_event.cpp:98
    frame #2455: 0x0000000100016ec5 cppcoro_tests`_DOCTEST_ANON_FUNC_8()::$_4::operator()() const::'lambda'()::operator()() const at async_auto_reset_event_tests.cpp:105
    frame #2456: 0x000000010028c13f cppcoro_tests`cppcoro::async_auto_reset_event::resume_waiters(unsigned long long) const [inlined] std::experimental::coroutines_v1::coroutine_handle<void>::resume(this=0x000000010129dbe8) at coroutine:123
    frame #2457: 0x000000010028c134 cppcoro_tests`cppcoro::async_auto_reset_event::resume_waiters(this=0x0000000101803640, initialState=4294967297) const at async_auto_reset_event.cpp:220
    frame #2458: 0x000000010028bb14 cppcoro_tests`cppcoro::async_auto_reset_event::set(this=0x0000000101803640) at async_auto_reset_event.cpp:98
    frame #2459: 0x0000000100016ec5 cppcoro_tests`_DOCTEST_ANON_FUNC_8()::$_4::operator()() const::'lambda'()::operator()() const at async_auto_reset_event_tests.cpp:105
    frame #2460: 0x000000010028c13f cppcoro_tests`cppcoro::async_auto_reset_event::resume_waiters(unsigned long long) const [inlined] std::experimental::coroutines_v1::coroutine_handle<void>::resume(this=0x000000010129dac8) at coroutine:123
    frame #2461: 0x000000010028c134 cppcoro_tests`cppcoro::async_auto_reset_event::resume_waiters(this=0x0000000101803640, initialState=4294967297) const at async_auto_reset_event.cpp:220
    frame #2462: 0x000000010028bb14 cppcoro_tests`cppcoro::async_auto_reset_event::set(this=0x0000000101803640) at async_auto_reset_event.cpp:98
    frame #2463: 0x0000000100016ec5 cppcoro_tests`_DOCTEST_ANON_FUNC_8()::$_4::operator()() const::'lambda'()::operator()() const at async_auto_reset_event_tests.cpp:105
    frame #2464: 0x000000010028c13f cppcoro_tests`cppcoro::async_auto_reset_event::resume_waiters(unsigned long long) const [inlined] std::experimental::coroutines_v1::coroutine_handle<void>::resume(this=0x000000010129d9a8) at coroutine:123
    frame #2465: 0x000000010028c134 cppcoro_tests`cppcoro::async_auto_reset_event::resume_waiters(this=0x0000000101803640, initialState=4294967297) const at async_auto_reset_event.cpp:220
    frame #2466: 0x000000010028bb14 cppcoro_tests`cppcoro::async_auto_reset_event::set(this=0x0000000101803640) at async_auto_reset_event.cpp:98
    frame #2467: 0x0000000100016ec5 cppcoro_tests`_DOCTEST_ANON_FUNC_8()::$_4::operator()() const::'lambda'()::operator()() const at async_auto_reset_event_tests.cpp:105
    frame #2468: 0x000000010029dbe0 cppcoro_tests`cppcoro::static_thread_pool::run_worker_thread(unsigned int) [inlined] std::experimental::coroutines_v1::coroutine_handle<void>::resume(this=0x000000010129d738) at coroutine:123
    frame #2469: 0x000000010029dbd5 cppcoro_tests`cppcoro::static_thread_pool::run_worker_thread(this=0x00007ffeefbfda50, threadIndex=2) at static_thread_pool.cpp:410
    frame #2470: 0x00000001002a45ae cppcoro_tests`cppcoro::static_thread_pool::static_thread_pool(this=0x00000001012041e8)::$_0::operator()() const at static_thread_pool.cpp:346
    frame #2471: 0x00000001002a4351 cppcoro_tests`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0> >(void*) [inlined] decltype(__f=0x00000001012041e8)::$_0>(fp)(std::__1::forward<>(fp0))) std::__1::__invoke<cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0>(cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0&&) at type_traits:4345
    frame #2472: 0x00000001002a4340 cppcoro_tests`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0> >(void*) [inlined] void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0>(__t=size=2)::$_0>&, std::__1::__tuple_indices<>) at thread:342
    frame #2473: 0x00000001002a431c cppcoro_tests`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0> >(__vp=0x00000001012041e0) at thread:352
    frame #2474: 0x00007fff72a21305 libsystem_pthread.dylib`_pthread_body + 126
    frame #2475: 0x00007fff72a2426f libsystem_pthread.dylib`_pthread_start + 70
    frame #2476: 0x00007fff72a20415 libsystem_pthread.dylib`thread_start + 13

yfinkelstein avatar Dec 23 '18 20:12 yfinkelstein

in lldb:

Process 98425 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x700001fa6fd0)
    frame #0: 0x000000010028c708 cppcoro_tests`(anonymous namespace)::local::get_resumable_waiter_count(state=<unavailable>) at async_auto_reset_event.cpp:33
   30  			}
   31
   32  			constexpr std::uint32_t get_resumable_waiter_count(std::uint64_t state)
-> 33  			{
   34  				return std::min(get_set_count(state), get_waiter_count(state));
   35  			}
   36  		}
Target 0: (cppcoro_tests) stopped.

yfinkelstein avatar Dec 23 '18 20:12 yfinkelstein

given the recursion depth it's probably not important where exactly it crashes though.

yfinkelstein avatar Dec 23 '18 20:12 yfinkelstein

The crash happens both in debug and release builds

yfinkelstein avatar Dec 23 '18 20:12 yfinkelstein

if thread pool size is changed from 3 to 1

	cppcoro::static_thread_pool tp{ 1 };

at the top of the test, the crash does not happen any more

yfinkelstein avatar Dec 23 '18 20:12 yfinkelstein

Yes, looking at the test it is a potential recursion problem. The call to event.set() resumes the awaiting coroutine inline inside the call to set().

This is both a flaw in the test and a flaw in the design of the async_manual_reset_event class.

One solution is to require the waiting coroutine to provide a scheduler to use to reschedule the resumption if the event was not signalled when it was awaited. Eg.

co_await event.wait(someScheduler);

Another solution would be to return a value that indicates whether the co_await completed synchronously and let the awaiting coroutine decide what to do if it completes asynchronously.

lewissbaker avatar Dec 26 '18 17:12 lewissbaker

provide a scheduler

I was long wondering how come this library can exist without a notion of a scheduler like the one necessary for stackfull coroutines such as https://www.boost.org/doc/libs/1_69_0/libs/fiber/doc/html/fiber/scheduling.html When suspending a coroutine after executing a non-blocking read io for instance, and then waking it up in io event polling loop there needs to be some sort of IO scheduler sitting in between that on one hand maps FD that received an IO to the coroutine handle but also schedules this coroutine to run in IO-scheduler sense of the word. I.e it may have some notion of coroutine priority , rate limit on FD, strand on FD ASIO-style, or whatever. It's clear that these concerns need to be factored out and coroutine scheduler is a good idea.

yfinkelstein avatar Dec 27 '18 05:12 yfinkelstein

There are already a notion of a scheduler in this library (see the Scheduler and TimedScheduler concepts and the io_service and static_thread_pool implementations), I can see a need for something like a prioritised or deadline scheduler, however.

By default a coroutine is resumed inline on the thread that dequeues the I/O completion event. If you want to avoid performing processing inline on the I/O thread then you can reschedule onto another scheduler (possibly a prioritised scheduler) after the coroutine is resumed.

eg.

task<void> do_something(socket s, static_thread_pool& tp)
{
  // Wait for data to be received.
  char buffer[100];
  co_await s.recv(buffer, sizeof(buffer));

  // Resumed on I/O thread here

  // Schedule processing of data onto thread pool.
  co_await tp.schedule();

  // Now on thread pool - process data
  process(buffer);
}

Something similar could be done with the async_auto_reset_event tests. eg. by manually rescheduling onto the threadpool if the operation completes asynchronously (if it completes synchronously then the coroutine resumes inline and there is no recursion to worry about).

task<void> example(async_auto_reset_event& event, static_thread_pool& tp)
{
  completion_type completion = co_await event;
  if (completion == completion_type::asynchronous)
  {
    // We were resumed from some other coroutine calling event.set()
    // Reschedule ourselves onto thread pool so we don't recursively call event.set()
    co_await tp.schedule();
  }

  // do work

  event.set();
}

This approach means that async_auto_reset_event doesn't need to take a dependency on schedulers but it still gives the consumer enough information to be able to make the decision as to whether or not it needs to reschedule to avoid recursion.

However, this design also puts more of the complexity onto the consumer. Whereas making a scheduler a mandatory parameter of a .wait() method leads the caller into better patterns but with the loss of some flexibility.

lewissbaker avatar Dec 31 '18 22:12 lewissbaker