cppcoro icon indicating copy to clipboard operation
cppcoro copied to clipboard

Random test failure of schedule_on() operator tests under msvc x86 debug

Open lewissbaker opened this issue 6 years ago • 5 comments

There have been some random test-failures on the CI servers.

eg. This test run failed with:

c:\projects\cppcoro\test\scheduling_operator_tests.cpp(21)
TEST SUITE: schedule/resume_on
TEST CASE:  schedule_on task<> function

c:\projects\cppcoro\test\scheduling_operator_tests.cpp(40) ERROR!
  CHECK( std::this_thread::get_id() == ioThreadId )
with expansion:
  CHECK( 3936 == 3800 )

There are a couple of working theories here:

  • caused by a compiler bug - msvc x86 coroutines support is still a little unreliable
  • caused by a lack of thread-safety in doctest - long shot as only one thread at a time should be executing CHECK statements
  • caused by a bug in schedule_on() operator itself - the implementation of this is a fairly thin wrapper over co_await scheduler.schedule()

lewissbaker avatar Apr 09 '18 05:04 lewissbaker

I had also in x64 (debug and release) from time to time such an error. The expected thread was not the one on which the coroutine resumed. I could identify the source of the trouble to be this change: https://github.com/lewissbaker/cppcoro/commit/316ce6c7dd40c1e78f65f60d886db1eb8cc570dd#diff-54dfdc33313e0a90b0e5d173a2d22e46 The schedule_on coroutine is returning a task. The task::awaitable_base::await_suspend may return false if the target thread is fast enough and sets the promise continuation before it. Thus it is resuming directly on the calling thread, instead to be suspended and resumed on the target thread. To fix this, you could add a policy to the task or make schedule_on return a different type (ex task_resume_deferred) that implements the awaitable_base::await_suspend as:

void await_suspend(std::experimental::coroutine_handle<> awaiter) noexcept
{
  // NOTE: This is the difference to regular task, it may not resume directly from here.
  m_coroutine.promise().try_set_continuation(detail::continuation{ awaiter });
  m_coroutine.resume();
}

tavi-cacina avatar Jun 10 '18 19:06 tavi-cacina

I have a commit where I added a resume_policy to the task, what do you think? I have tested also with a separate task_resume_deferred type returned from the schedule_on, but it does not compose so well.

tavi-cacina avatar Jul 17 '18 13:07 tavi-cacina

@tavi-cacina Thanks for the investigation and for the proposed patch!

The schedule_on operator only specifies that the coroutine will start execution on the specified scheduler and not that it will complete and resume the awaiting on the specified scheduler.

When this test was first implemented the task<T> implementation actually guaranteed to resume the awaiting coroutine on the same thread that the task's coroutine completed on. With the current implementation, as you correctly point out, it's possible for the task's coroutine to be scheduled onto another thread and run to completion before the caller is able to register the continuation and so the awaiting coroutine is resumed immediately on the thread executing the co_await rather than on the thread the task completed on.

This test is currently relying on the old behaviour of task to resume the awaiting coroutine on the same thread that it completed on.

For now we could just remove the second CHECK in this test as it's not really testing the schedule_on operator but rather that the task resumes the awaiting coroutine on the thread that it completed on (which is currently not the case). https://github.com/lewissbaker/cppcoro/blob/2492c0782911cdf7fe803642040efff3faa3c28a/test/scheduling_operator_tests.cpp#L33-L40

The correct solution here is to take advantage of the new variant of await_suspend() that allows returning a coroutine_handle to resume with a guaranteed tail-call (aka symmetric transfer) that can avoid the unbounded stack-growth for mutually recursive coroutine resumption.

There is an experimental implementation of this version of task in the tailcall branch. I did not merge this change into master at the time as MSVC did not yet support the ofawait_suspend overload. However I believe that MSVC now supports this feature so perhaps it is time to merge this change in?

lewissbaker avatar Jul 17 '18 16:07 lewissbaker

You are right, I was confused because I played with a custom Scheduler and expected to be after co_await context.schedule(someTask) in the other context. The signature was task schedule(task) and that caused this behavior. I saw then this issue here, thought that is the same. I am glad that there is no need to change the task for this :-)

About the tailcall I do not have the insight into it. Do everything you think is an improvement!

tavi-cacina avatar Jul 17 '18 20:07 tavi-cacina

Update: MSVC still does not support symmetric transfer (at least for 2017 Update 9). I am still yet to test against 2019 builds but I have not heard that thes

Note that I have since merged some of the changes to task from the tailcall branch that allows it to use symmetric-transfer to avoid these problems under clang builds. See PR #91.

lewissbaker avatar Mar 04 '19 21:03 lewissbaker