hpx icon indicating copy to clipboard operation
hpx copied to clipboard

Parcelport fails to initialize when multiple jobs run on the same cluster

Open antoniupop opened this issue 3 years ago • 4 comments

Expected Behavior

Expected is that multiple independent jobs (e.g. SLURM job array) can run concurrently on the same cluster (on disjoint sets of nodes, not co-scheduled).

Actual Behavior

Only one (or none) of the jobs is able to run while all others crash at initialization with the following errors:

the bootstrap parcelport (tcp) has failed to initialize on locality 0:
<unknown>: HPX(network_error),
bailing out
terminate called without an active exception
srun: error: queue1-dy-m5a2xlarge-1: task 0: Exited with exit code 255
the bootstrap parcelport (tcp) has failed to initialize on locality 4294967295:
<unknown>: HPX(network_error),
bailing out
terminate called without an active exception
the bootstrap parcelport (tcp) has failed to initialize on locality 4294967295:
<unknown>: HPX(network_error),
bailing out

Steps to Reproduce the Problem

Schedule multiple jobs on a SLURM cluster without dependences and only using a subset of the nodes (so allowing the SLURM scheduler to start multiple instances on separate partitions).

Tried to use MPI parcelport and disable TCP to no avail (error changes, but still fails to initialize).

Specifications

  • HPX Version: 1.7.1 and 1.8.1 tried
  • Platform (compiler, OS): Ubuntu / GCC

antoniupop avatar Dec 07 '22 16:12 antoniupop

Disabling the TCP parcelport should help. How did you disable it?

hkaiser avatar Dec 07 '22 17:12 hkaiser

Disabling the TCP parcelport should help. How did you disable it?

I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.

antoniupop avatar Dec 07 '22 18:12 antoniupop

Disabling the TCP parcelport should help. How did you disable it?

I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.

Could you give us the error message you see in this case, please?

hkaiser avatar Dec 07 '22 19:12 hkaiser

I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.

Could you give us the error message you see in this case, please?

I used to get an error message along the lines of failure to initialise Parcelport before, but now it's crashing with the following:

0x7f5be6dfc3c0  : /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f5be6dfc3c0] in /lib/x86_64-linux-gnu/libpthread.so.0
0x7f5be6437513  : /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1(+0x6ec513) [0x7f5be6437513] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be5f8e5f7  : hpx::parcelset::detail::parcel_await_apply(hpx::parcelset::parcel&&, hpx::util::function<void (std::error_code const&, hpx::parcelset::parcel const&), false>&&, unsigned int, hpx::util::unique_function<void (hpx::parcelset::parcel&&, hpx::util::function<void (std::error_code const&, hpx::parcelset::parcel const&), false>&&), false>) [0xc7] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be643cbc2  : void hpx::agas::big_boot_barrier::apply<hpx::actions::direct_action<void (*)(hpx::agas::registration_header const&), &hpx::agas::register_worker, hpx::actions::detail::this_type>, hpx::agas::registration_header>(unsigned int, unsigned int, hpx::parcelset::locality, hpx::actions::direct_action<void (*)(hpx::agas::registration_header const&), &hpx::agas::register_worker, hpx::actions::detail::this_type>, hpx::agas::registration_header&&) [0x1a2] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be64362cc  : hpx::agas::big_boot_barrier::wait_hosted(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) [0x4fc] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be644c563  : hpx::runtime_distributed::initialize_agas() [0x283] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be644fd47  : hpx::runtime_distributed::runtime_distributed(hpx::util::runtime_configuration&, int (*)(hpx::runtime_mode)) [0xf17] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be62e176e  : hpx::detail::run_or_start(hpx::util::function<int (hpx::program_options::variables_map&), false> const&, int, char**, hpx::init_params const&, bool) [0xd8e] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1

I'm not quite sure of what changed for this to now crash instead, still looking to reproduce previous behaviour, but there is no difference between the code run now with MPI parcelport and the initial code using TCP.

antoniupop avatar Dec 08 '22 08:12 antoniupop