hpx icon indicating copy to clipboard operation
hpx copied to clipboard

The new LCW (or MPIx) parcelport

Open JiakunYan opened this issue 1 month ago • 3 comments

The advanced MPI parcelport that can utilize the MPICH VCI and Continuation extensions.

The exact MPI implementation is wrapped in a thin wrapper layer (the Lightweight Communication Wrapper, or LCW). This wrapper layer provides an active message, send/recv, completion queue, and device abstraction on top of MPI (with/without extensions), GASNet-EX, and LCI.

The GASNet-EX backend of LCW cannot be used in HPX for now, as it lacks send/recv support.

Important CMake Variables:

  • HPX_WITH_PARCELPORT_LCW (default OFF): enable the LCW parcelport.
  • HPX_WITH_FETCH_LCW (default: OFF): enable LCW autofetch. Important Runtime Variables:
  • --hpx:ini=hpx.parcel.lcw.ndevices=<n> (default: 2): the number of the LCW devices to use. Each LCW device will be mapped to an MPI communicator or an LCI device.

TODO:

  • [x] Add CI.
  • [x] Add documentation.

JiakunYan avatar Nov 28 '25 18:11 JiakunYan

@JiakunYan Thank you for working on this!

hkaiser avatar Nov 28 '25 21:11 hkaiser

@hkaiser Do you have any suggestions on how I might resolve the GCC 15 errors? They seem to be related to the C++20 modules.

Also, the new parcelport can occasionally fail the tests.unit.modules.runtime_components.distributed.lcw.migrate_polymorphic_component test. Do you have thoughts on what might happen (e.g., does this test involve large metadata transfers)?

JiakunYan avatar Nov 30 '25 16:11 JiakunYan

@hkaiser Do you have any suggestions on how I might resolve the GCC 15 errors? They seem to be related to the C++20 modules.

Also, the new parcelport can occasionally fail the tests.unit.modules.runtime_components.distributed.lcw.migrate_polymorphic_component test. Do you have thoughts on what might happen (e.g., does this test involve large metadata transfers)?

The migration error is unrelated, we see it on other platforms occasionally as well. For the gcc15 errors (https://cdash.rostam.cct.lsu.edu/viewBuildError.php?buildid=33246), you simply missed to #include <cstdint>, possibly you will need to qualify the types (e.g. std::uint32_t).

Please ignore the strange CI problems on rostam (where no results are reported at all), we're working on fixing this.

hkaiser avatar Dec 04 '25 15:12 hkaiser