hpx icon indicating copy to clipboard operation
hpx copied to clipboard

barrier lockup

Open biddisco opened this issue 3 years ago • 3 comments

I have an unusual setup ... an hpx thread is working on some message sending, to improve latency, I call parcelport->progress() which jumps into the parcelport code and in effect triggers the background work - this functionality is exposed in my libfabric branch and I was hoping to move it into the main branch at some point.

the task on rank 0 looks like this

loop
    send some messages
    if messages_in_flight>threshold
        parcelport->progress (and reduce the number in_flight)
    endif
endloop
hpx::distributed::barrier::synchronize();

The task on other ranks looks the same.

Unfortunately, I get deadlocks because during the call to progress on rank 0 - rank 1 might enter the barrier and send a barrier related message to rank 0 - which has not yet entered the barrier - rank 0 responds by suspending the thread from inside barrier_node::set_event() which then stops the task on rank 0 from ever reaching the barrier on rrank 0 - so resulting in a deadlock

<DEB> 0020168093 -------------- 0x7ffff138d640 cpu --- oryx2(-1)     PPORT   Rank                 00 Suspended threads Stack trace 0 : 0x555555fdf6c0 : 
19 frames:
0x7ffff673b04c  : hpx::threads::execution_agent::do_yield(char const*, hpx::threads::thread_schedule_state) [0x296] in /home/biddisco/build/hpx-rma/lib/libhpx_cored.so
0x7ffff673ac66  : hpx::threads::execution_agent::suspend(char const*) [0x2c] in /home/biddisco/build/hpx-rma/lib/libhpx_cored.so
0x7ffff652b930  : hpx::execution_base::agent_ref::suspend(char const*) [0xca] in /home/biddisco/build/hpx-rma/lib/libhpx_cored.so
0x7ffff664c786  : hpx::lcos::local::detail::condition_variable::wait(std::unique_lock<hpx::lcos::local::spinlock>&, char const*, hpx::error_code&) [0x12a] in /home/biddisco/build/hpx-rma/lib/libhpx_cored.so
0x7ffff7574bbc  : /home/biddisco/build/hpx-rma/lib/libhpxd.so.1(+0x8aebbc) [0x7ffff7574bbc] in /home/biddisco/build/hpx-rma/lib/libhpxd.so.1
0x7ffff75629fd  : hpx::distributed::detail::barrier_node::set_event() [0x135] in /home/biddisco/build/hpx-rma/lib/libhpxd.so.1
0x7ffff741a2bd  : hpx::lcos::base_lco::set_event_nonvirt() [0x23] in /home/biddisco/build/hpx-rma/lib/libhpxd.so.1
0x7ffff74250a0  : /home/biddisco/build/hpx-rma/lib/libhpxd.so.1(+0x75f0a0) [0x7ffff74250a0] in /home/biddisco/build/hpx-rma/lib/libhpxd.so.1
0x7ffff7424b6a  : /home/biddisco/build/hpx-rma/lib/libhpxd.so.1(+0x75eb6a) [0x7ffff7424b6a] in /home/biddisco/build/hpx-rma/lib/libhpxd.so.1
0x7ffff742371c  : /home/biddisco/build/hpx-rma/lib/libhpxd.so.1(+0x75d71c) [0x7ffff742371c] in /home/biddisco/build/hpx-rma/lib/libhpxd.so.1
0x7ffff741ce0d  : void hpx::actions::transfer_continuation_action<hpx::lcos::base_lco::set_event_action>::schedule_thread<>(hpx::util::pack_c<unsigned long>, hpx::naming::gid_type const&, void*, int, unsigned long) [0x343] in /home/biddisco/build/hpx-rma/lib/libhpxd.so.1
0x7ffff741aee1  : hpx::actions::transfer_continuation_action<hpx::lcos::base_lco::set_event_action>::schedule_thread(hpx::naming::gid_type const&, void*, int, unsigned long) [0x3d] in /home/biddisco/build/hpx-rma/lib/libhpxd.so.1
0x7ffff741b007  : hpx::actions::transfer_continuation_action<hpx::lcos::base_lco::set_event_action>::load_schedule(hpx::serialization::input_archive&, hpx::naming::gid_type&&, void*, int, unsigned long, bool&) [0x79] in /home/biddisco/build/hpx-rma/lib/libhpxd.so.1
0x7ffff7713362  : hpx::parcelset::detail::parcel::load_schedule(hpx::serialization::input_archive&, unsigned long, bool&) [0x1c2] in /home/biddisco/build/hpx-rma/lib/libhpxd.so.1
0x7ffff772bf17  : hpx::parcelset::parcel::load_schedule(hpx::serialization::input_archive&, unsigned long, bool&) [0x81] in /home/biddisco/build/hpx-rma/lib/libhpxd.so.1
0x555555611825  : /home/biddisco/build/hpx-rma/bin/network_storage(+0xbd825) [0x555555611825] in /home/biddisco/build/hpx-rma/bin/network_storage
0x55555560eafe  : /home/biddisco/build/hpx-rma/bin/network_storage(+0xbaafe) [0x55555560eafe] in /home/biddisco/build/hpx-rma/bin/network_storage
0x55555560e779  : /home/biddisco/build/hpx-rma/bin/network_storage(+0xba779) [0x55555560e779] in /home/biddisco/build/hpx-rma/bin/network_storage
0x55555560b50e  : /home/biddisco/build/hpx-rma/bin/network_storage(+0xb750e) [0x55555560b50e] in /home/biddisco/build/hpx-rma/bin/network_storage

This behaviour is obviously not standard practice for hpx since I am progressing the network on an hpx task - however, I'd like to know if this is actually a bug in the barrier code - it has not been reached on rank 0 (it may not be the first time it is used and after N uses, it might not be reset correctly hpx::distributed::barrier::synchronize(); uses a global barrier) - it should not be yielding before the local rank has entered the barrier.

Can this be fixed? or is it a show stopper that prevents my use case?

biddisco avatar Mar 31 '22 14:03 biddisco

John,

the barrier (local and distributed) API by design requires for all parties to join before any party can join a second time (barrier-epoches shouldn't overlap). The user code is required to ensure this.

Since HPX uses a global barrier internally only during startup and shutdown (i.e. before hpx_main is executed and after it exited), it should be possible for you to enable your own progress functionality only if hpx::threads::threadmanager_is(hpx::state::running) returns true.

hkaiser avatar Apr 01 '22 14:04 hkaiser

These conditions are satisfied. hpx::distributed::barrier::synchronize(); is used during startup, so "requires for all parties to join before any party can join a second time" is always true - the barrier has always been entered and exited on all ranks during startup, so when my user code is reached, the barrier should have been initialized and threadmanager_is(hpx::state::running) must be true since the runtime is always up and running when the tests are run.

In addition, the barrier lockups do not happen on the first iteration, but usually later on after the barrier has been used several times. When running on 3 ranks, I have noticed that a lockup only ever happens when rank 1, rank 2 enter the barrier and rank 0 receives messages from the other ranks before rank 0 has entered the barrier. In other words, when rank 0 is last to enter the barrier on the non first execution of the barrier.

Is it possible the barrier reset/wait is the victim of a condition variable style wait that is triggered before the thread that wants to wait for it has started waiting?

biddisco avatar Apr 02 '22 20:04 biddisco

The global barrier by default simply uses a local barrier on locality 0 for the synchronization (see; https://github.com/STEllAR-GROUP/hpx/blob/master/libs/full/collectives/src/detail/barrier_node.cpp#L308). The local barrier arrive_and_wait() is here: https://github.com/STEllAR-GROUP/hpx/blob/master/libs/core/synchronization/include/hpx/synchronization/barrier.hpp#L214-L223.

Essentially, everything happens under a lock, so I don't see a way for it to misbehave. However, as always with concurrent code, a second pair of eyes might see more...

hkaiser avatar Apr 02 '22 22:04 hkaiser

This was fixed on master. Please reopen if needed.

hkaiser avatar Aug 21 '23 16:08 hkaiser