hpx icon indicating copy to clipboard operation
hpx copied to clipboard

Current thread_id changing after suspending in direct execution

Open Pansysk75 opened this issue 2 years ago • 4 comments

Background

@hkaiser I came across this issue while investigating the current hangs in cancelable_action test.

In that test, the current thread_id is firstly obtained using hpx::this_thread::get_id() (through a reset_id helper object). Then, we have a workloop that occasionally suspends, waiting to be interrupted (hpx::thread::interrupt(id_)).

https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/examples/cancelable_action/cancelable_action/server/cancelable_action.hpp#L81-L94

Issue

I found that calling hpx::this_thread::get_id() before and after hpx::this_thread::suspend() would often return a different id. I believe this is due to the way yielding is handled when utilizing direct execution:

https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/libs/core/coroutines/include/hpx/coroutines/detail/coroutine_stackful_self_direct.hpp#L34-L39

coroutine_stackful_self_direct falls back to next_self_ (its calling context) for yielding, which will fail to correctly restore the state after yielding:

https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/libs/core/coroutines/include/hpx/coroutines/detail/coroutine_stackful_self.hpp#L32-L42

If I understand correctly, reset_self_on_exit will cause the static variable local_self_ to refer to the "simple" (coroutine_stackful_self) coroutine after yielding (and not to the direct coroutine where it used to before the yield, see the use of reset_self_on_exit on coroutine_impl::invoke_directly()).

The solution is probably simple, but I just barely got to understanding how these coroutines work, so I just hope I got things right fttb.

Pansysk75 avatar Aug 25 '23 22:08 Pansysk75

@Pansysk75 Yep, that's caused by the direct execution patch. cancelable_action should call get_outer_id instead (here: https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/examples/cancelable_action/cancelable_action/server/cancelable_action.hpp#L61). Excellent catch!

hkaiser avatar Aug 25 '23 22:08 hkaiser

@Pansysk75 Yep, that's caused by the direct execution patch. cancelable_action should call get_outer_id instead (here:

https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/examples/cancelable_action/cancelable_action/server/cancelable_action.hpp#L61

). Excellent catch!

I just realized that we don't have that - what an oversight! It should be added here: https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/libs/core/threading/include/hpx/threading/thread.hpp#L284 and here: https://github.com/STEllAR-GROUP/hpx/blob/bc993dd15b9252908e0b30b9ed5efd830cf99909/libs/core/threading/src/thread.cpp#L386-L389

The new code should be:

        thread::id get_outer_id() noexcept
        {
            return thread::id(threads::get_outer_self_id());
        }

hkaiser avatar Aug 25 '23 22:08 hkaiser

That should work, but as a user I would really wonder what "outer" refers to and why I should use that (if I didn't know the implementation details). I was thinking of perhaps modifying the yield operation so that the state before and after the yield would be the same, if that is at all possible. That should make 'hpx::this_thread::get_id()' return some constant id (the one of the thread being "directly executed" i suppose) Is there a good reason why a different id should be reported right after suspension? Is it a bug or is it something that we expect?

Pansysk75 avatar Aug 25 '23 23:08 Pansysk75

Alternatively we could make the current implementation of get_id call the get_outer_self_id instead. Not sure what this would break, though...

hkaiser avatar Aug 25 '23 23:08 hkaiser