Current thread_id changing after suspending in direct execution
Background
@hkaiser I came across this issue while investigating the current hangs in cancelable_action test.
In that test, the current thread_id is firstly obtained using hpx::this_thread::get_id() (through a reset_id helper object).
Then, we have a workloop that occasionally suspends, waiting to be interrupted (hpx::thread::interrupt(id_)).
https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/examples/cancelable_action/cancelable_action/server/cancelable_action.hpp#L81-L94
Issue
I found that calling hpx::this_thread::get_id() before and after hpx::this_thread::suspend() would often return a different id.
I believe this is due to the way yielding is handled when utilizing direct execution:
https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/libs/core/coroutines/include/hpx/coroutines/detail/coroutine_stackful_self_direct.hpp#L34-L39
coroutine_stackful_self_direct falls back to next_self_ (its calling context) for yielding, which will fail to correctly restore the state after yielding:
https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/libs/core/coroutines/include/hpx/coroutines/detail/coroutine_stackful_self.hpp#L32-L42
If I understand correctly, reset_self_on_exit will cause the static variable local_self_ to refer to the "simple" (coroutine_stackful_self) coroutine after yielding (and not to the direct coroutine where it used to before the yield, see the use of reset_self_on_exit on coroutine_impl::invoke_directly()).
The solution is probably simple, but I just barely got to understanding how these coroutines work, so I just hope I got things right fttb.
@Pansysk75 Yep, that's caused by the direct execution patch. cancelable_action should call get_outer_id instead (here: https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/examples/cancelable_action/cancelable_action/server/cancelable_action.hpp#L61). Excellent catch!
@Pansysk75 Yep, that's caused by the direct execution patch. cancelable_action should call
get_outer_idinstead (here:https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/examples/cancelable_action/cancelable_action/server/cancelable_action.hpp#L61
). Excellent catch!
I just realized that we don't have that - what an oversight! It should be added here: https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/libs/core/threading/include/hpx/threading/thread.hpp#L284 and here: https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/libs/core/threading/src/thread.cpp#L386-L389
The new code should be:
thread::id get_outer_id() noexcept
{
return thread::id(threads::get_outer_self_id());
}
That should work, but as a user I would really wonder what "outer" refers to and why I should use that (if I didn't know the implementation details). I was thinking of perhaps modifying the yield operation so that the state before and after the yield would be the same, if that is at all possible. That should make 'hpx::this_thread::get_id()' return some constant id (the one of the thread being "directly executed" i suppose) Is there a good reason why a different id should be reported right after suspension? Is it a bug or is it something that we expect?
Alternatively we could make the current implementation of get_id call the get_outer_self_id instead. Not sure what this would break, though...