mpich
mpich copied to clipboard
ch4/part: improved partitioned communications [WIP]
Pull Request Description
This PR updates the partitioned communications implementation. It provides the following improvements:
> early bird: start the communication of a partition as soon as it's ready.
This brings several major code changes to the existing partitioned communication capability:
- introduce a tag-based communication based on
MPID_isend
/MPID_irecv
to exploit multiple VCIs and the bandwidth to its full potential - To fit all the information inside the tag, there is a limit on the number of partitionned communication per source rank (to avoid collision in the msgs). The current limit is
1
. If the the limit is reached we default to AM code path - the AM code path has been refactored to allow early-bird send/recv but is limited to a single VCI
- in every case, the send and receive side both agree on the number of msgs to be actually sent to avoid fraction of datatype. By default we try to use the number of partitions on the send side. If that number is not an divisor of the number of total datatypes on the recv-side we use the
gcd
of the number of partitions on both the send and the recv side.
Todolist:
- [ ] expand the possibility to have multiple partitioned communications per rank
Author Checklist
- [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [x] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
test:mpich/custom netmod:ch4:ofi testlist:part
test:mpich/custom netmod:ch4:ofi testlist:part
test:mpich/custom netmod:ch4:ofi testlist:part
test:mpich/custom netmod:ch4:ofi testlist:part
test:mpich/custom netmod:ch4:ofi testlist:part
test:mpich/custom netmod:ch4:ofi testlist:part
test:mpich/custom netmod:ch4:ofi testlist:part
test:mpich/custom netmod:ch4:ofi testlist:part
test:mpich/custom netmod:ch4:ofi testlist:part
Hi, I was interested in testing and experimenting with this partitioned communication feature on the Perlmutter machine. I built MPICH in the thomasgillis:part-early-bird
branch and tried to run hello world but it goes in an infinite loop inside MPI_Init() for -ntasks >=2
.
For ntasks >= 2
I also get a segfault if the CUDA_VISIBLE_DEVICES
isn't set even if the main code doesn't need CUDA. I was able to bypass the segfault for now by appropriately setting and running the example on a GPU node for now.
I do get these warnings during the build and was wondering if this could be the reason:
libtool: warning: relinking 'lib/libmpifort.la'
libtool: install: (cd /global/u2/m/mhaseeb/repos/part-mpi/mpich.IHudebo/mpich; /bin/sh "/global/u2/m/mhaseeb/repos/part-mpi/mpich.IHudebo/mpich/libtool" --silent --tag FC --mode=relink gfortran -Isrc/binding/fortran/use_mpi -Isrc/binding/fortran/use_mpi_f08 -fallow-argument-mismatch -fPIC -O2 -version-info 0:0:0 -L/opt/cray/pe/pmi/default/lib -L/opt/cray/libfabric/1.15.2.0/lib64 -L/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/lib64 -L/opt/cray/xpmem/default/lib -L/opt/cray/xpmem/default/lib64 -o lib/libmpifort.la -rpath /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib src/binding/fortran/mpif_h/lib_libmpifort_la-attr_proxy.lo src/binding/fortran/mpif_h/lib_libmpifort_la-fortran_binding.lo src/binding/fortran/mpif_h/lib_libmpifort_la-fdebug.lo src/binding/fortran/mpif_h/lib_libmpifort_la-setbot.lo src/binding/fortran/mpif_h/setbotf.lo src/binding/fortran/use_mpi/mpi.lo src/binding/fortran/use_mpi/mpi_constants.lo src/binding/fortran/use_mpi/mpi_sizeofs.lo src/binding/fortran/use_mpi/mpi_base.lo src/binding/fortran/use_mpi_f08/pmpi_f08.lo src/binding/fortran/use_mpi_f08/mpi_f08.lo src/binding/fortran/use_mpi_f08/mpi_f08_callbacks.lo src/binding/fortran/use_mpi_f08/mpi_f08_compile_constants.lo src/binding/fortran/use_mpi_f08/mpi_f08_link_constants.lo src/binding/fortran/use_mpi_f08/mpi_f08_types.lo src/binding/fortran/use_mpi_f08/mpi_c_interface.lo src/binding/fortran/use_mpi_f08/mpi_c_interface_cdesc.lo src/binding/fortran/use_mpi_f08/mpi_c_interface_glue.lo src/binding/fortran/use_mpi_f08/mpi_c_interface_nobuf.lo src/binding/fortran/use_mpi_f08/mpi_c_interface_types.lo src/binding/fortran/use_mpi_f08/wrappers_f/f08ts.lo src/binding/fortran/use_mpi_f08/wrappers_f/pf08ts.lo src/binding/fortran/use_mpi_f08/wrappers_c/lib_libmpifort_la-f08_cdesc.lo src/binding/fortran/use_mpi_f08/wrappers_c/lib_libmpifort_la-cdesc.lo src/binding/fortran/use_mpi_f08/wrappers_c/lib_libmpifort_la-comm_spawn_c.lo src/binding/fortran/use_mpi_f08/wrappers_c/lib_libmpifort_la-comm_spawn_multiple_c.lo src/binding/fortran/use_mpi_f08/wrappers_c/lib_libmpifort_la-utils.lo lib/libmpi.la -lpmi -lpmi2 -lfabric -Wl,--as-needed,-lcudart,--no-as-needed -lcuda )
libtool: warning: '/opt/cray/pe/gcc/11.2.0/snos/lib64/libatomic.la' seems to be moved
libtool: warning: '/opt/cray/pe/gcc/11.2.0/snos/lib64/libstdc++.la' seems to be moved
libtool: install: /usr/bin/install -c lib/.libs/libmpifort.so.0.0.0T /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib/libmpifort.so.0.0.0
libtool: install: (cd /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib && { ln -s -f libmpifort.so.0.0.0 libmpifort.so.0 || { rm -f libmpifort.so.0 && ln -s libmpifort.so.0.0.0 libmpifort.so.0; }; })
libtool: install: (cd /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib && { ln -s -f libmpifort.so.0.0.0 libmpifort.so || { rm -f libmpifort.so && ln -s libmpifort.so.0.0.0 libmpifort.so; }; })
libtool: install: /usr/bin/install -c lib/.libs/libmpifort.lai /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib/libmpifort.la
libtool: warning: relinking 'lib/libmpicxx.la'
libtool: install: (cd /global/u2/m/mhaseeb/repos/part-mpi/mpich.IHudebo/mpich; /bin/sh "/global/u2/m/mhaseeb/repos/part-mpi/mpich.IHudebo/mpich/libtool" --silent --tag CXX --mode=relink g++ -fPIC -O2 -version-info 0:0:0 -L/opt/cray/pe/pmi/default/lib -L/opt/cray/libfabric/1.15.2.0/lib64 -L/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/lib64 -L/opt/cray/xpmem/default/lib -L/opt/cray/xpmem/default/lib64 -o lib/libmpicxx.la -rpath /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib src/binding/cxx/initcxx.lo lib/libmpi.la -lpmi -lpmi2 -lfabric -Wl,--as-needed,-lcudart,--no-as-needed -lcuda )
libtool: warning: '/opt/cray/pe/gcc/11.2.0/snos/lib64/libatomic.la' seems to be moved
libtool: warning: '/opt/cray/pe/gcc/11.2.0/snos/lib64/libstdc++.la' seems to be moved
Any help with the infinite loop would be appreciated. Thanks.
@mhaseeb123 thanks for reaching out.
FWIW the branch is disconnected from the main
and it's on my summer todo to merge it back :-)
Have you tried with the main branch and does it work over there?
Could you also detail your config options (especially the vci
ones).
Also, out of curiosity, what is your use case for this?
@thomasgillis thank you for your response. Yes, I am aware the branch is disconnected from the main
branch. Yes, I did build the pmodels:mpich:main
branch using the same everything and it does work perfectly (except the segfault error which is bypassed if CUDA_VISIBLE_DEVICES
is set). Maybe I should also build the thomasgillis:main
and see if it also works.
I was thinking of merging the code from thomasgillis:part-early-bird
into pmodels:mpich:main
and see if it works. I wanted to study the advantage and application of partitioned communication feature in a couple of scientific codes.
Yes, here is my config command. Assume all vars are appropriately set.
./configure ${opts} \
CPPFLAGS=${CPPFLAGS} \
CC=${CC} \
CFLAGS=${CFLAGS} \
CXX=${CXX} \
FC=${FC} \
FCFLAGS=${FCFLAGS} \
F77=${FC} \
FFLAGS=${FCFLAGS} \
LIBS="${LIBS}" \
LDFLAGS="${LDFLAGS}" \
MPICHLIB_CFLAGS="-fPIC" \
MPICHLIB_CXXFLAGS="-fPIC" \
MPICHLIB_FFLAGS="-fPIC" \
MPICHLIB_FCFLAGS="-fPIC"
Here opts = --with-cuda --with-device=ch4:ofi --with-libfabric=${FABRIC_PREFIX} --with-libfabric-include=${FABRIC_INCLUDE} --with-libfabric-lib=${FABRIC_LIB_DIR} --enable-fast=O2 --with-pm=no --with-pmi=cray --with-xpmem=${XPMEM} --with-wrapper-dl-type=rpath --enable-threads=multiple --enable-shared=yes --enable-static=no --with-namepublisher=file
Ok, do you mind sharing your testcode? it will be easier to debug
Also, the issue you observe, is it in MPI_Init
or MPI_PSend/Precv_init
?
Finally, could you report the cuda visible issue separately?
test:mpich/custom netmod:ch4:ofi testlist:part
Right now I am only literally running MPI Hello World which is as follows:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
// see if MPI came out of init
//printf("Hello!");
//fflush(stdout);
int comm_sz;
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
char hostname[MPI_MAX_PROCESSOR_NAME];
int name_sz;
MPI_Get_processor_name(hostname, &name_sz);
printf("Hello from %s, rank %d (of %d).\n", hostname, rank, comm_sz);
MPI_Finalize();
return 0;
}
The code never prints anything after MPI_Init
so I am assuming it's stuck there. Yes, I can report that issue separately, it's not really causing any blocking issues. Here is what I see when I run it with only 2 ranks.
ok, it doesn't sound like it's related to this PR then :-) I will try to rebase asap to incorporate the recent fixes and it will hopefully fix the issue
Yayy! I am hoping that as well. Hence, my thought of locally merging from this PR to the current pmodels:mpich:main
. Thank you for your help @thomasgillis, really appreciate it!
Update: I built from the thomasgillis:main
branch and got the same hanging issue inside MPI_Init
so I think rebasing from current mpich:main
would def solve it. I am also working on locally patching the mpich:main
with changes from this PR and seeing if that would help! Thanks again, cheers!
Hi @thomasgillis sorry for bothering again. I tried to merge your part-early-bird
branch in my local fork of mpich:main
but due to my unfamiliarity with the implementation, all my rebasing efforts led to buggy merged code which couldn't compile. I would really appreciate it if you could rebase your branch at your earliest convenience. Thank you and looking forward to trying out the new partitioned communication feature in MPICH!
test:mpich/custom netmod:ch4:ofi testlist:part
@mhaseeb123 I have just rebased it, going to work on the merge soon.
FYI they are a few work exploring the partitioned communication usage and when it's valuable, including our recent publication: https://arxiv.org/abs/2308.03930
@thomasgillis thank you for the update. I will check out the preprint today. I have been able to merge your branch into my local fork without issues so that's a good sign so far. I will build from it in a bit and run some examples/tests. Thank you again for doing this! Really appreciate it. I will let you know if I unlikely run into any issues!