mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ch4/part: improved partitioned communications [WIP]

Open thomasgillis opened this issue 2 years ago • 21 comments

Pull Request Description

This PR updates the partitioned communications implementation. It provides the following improvements:

> early bird: start the communication of a partition as soon as it's ready.

This brings several major code changes to the existing partitioned communication capability:

  • introduce a tag-based communication based on MPID_isend/MPID_irecv to exploit multiple VCIs and the bandwidth to its full potential
  • To fit all the information inside the tag, there is a limit on the number of partitionned communication per source rank (to avoid collision in the msgs). The current limit is 1. If the the limit is reached we default to AM code path
  • the AM code path has been refactored to allow early-bird send/recv but is limited to a single VCI
  • in every case, the send and receive side both agree on the number of msgs to be actually sent to avoid fraction of datatype. By default we try to use the number of partitions on the send side. If that number is not an divisor of the number of total datatypes on the recv-side we use the gcd of the number of partitions on both the send and the recv side.

Todolist:

  • [ ] expand the possibility to have multiple partitioned communications per rank

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

thomasgillis avatar Dec 02 '22 00:12 thomasgillis

test:mpich/custom netmod:ch4:ofi testlist:part

raffenet avatar Jan 04 '23 02:01 raffenet

test:mpich/custom netmod:ch4:ofi testlist:part

thomasgillis avatar Jan 04 '23 19:01 thomasgillis

test:mpich/custom netmod:ch4:ofi testlist:part

thomasgillis avatar Jan 04 '23 20:01 thomasgillis

test:mpich/custom netmod:ch4:ofi testlist:part

thomasgillis avatar Jan 26 '23 23:01 thomasgillis

test:mpich/custom netmod:ch4:ofi testlist:part

thomasgillis avatar Jan 27 '23 21:01 thomasgillis

test:mpich/custom netmod:ch4:ofi testlist:part

thomasgillis avatar Feb 01 '23 23:02 thomasgillis

test:mpich/custom netmod:ch4:ofi testlist:part

thomasgillis avatar Feb 08 '23 20:02 thomasgillis

test:mpich/custom netmod:ch4:ofi testlist:part

thomasgillis avatar Feb 08 '23 23:02 thomasgillis

test:mpich/custom netmod:ch4:ofi testlist:part

thomasgillis avatar Feb 09 '23 17:02 thomasgillis

Hi, I was interested in testing and experimenting with this partitioned communication feature on the Perlmutter machine. I built MPICH in the thomasgillis:part-early-bird branch and tried to run hello world but it goes in an infinite loop inside MPI_Init() for -ntasks >=2.

For ntasks >= 2 I also get a segfault if the CUDA_VISIBLE_DEVICES isn't set even if the main code doesn't need CUDA. I was able to bypass the segfault for now by appropriately setting and running the example on a GPU node for now.

I do get these warnings during the build and was wondering if this could be the reason:

libtool: warning: relinking 'lib/libmpifort.la'
libtool: install: (cd /global/u2/m/mhaseeb/repos/part-mpi/mpich.IHudebo/mpich; /bin/sh "/global/u2/m/mhaseeb/repos/part-mpi/mpich.IHudebo/mpich/libtool"  --silent --tag FC --mode=relink gfortran -Isrc/binding/fortran/use_mpi -Isrc/binding/fortran/use_mpi_f08 -fallow-argument-mismatch -fPIC -O2 -version-info 0:0:0 -L/opt/cray/pe/pmi/default/lib -L/opt/cray/libfabric/1.15.2.0/lib64 -L/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/lib64 -L/opt/cray/xpmem/default/lib -L/opt/cray/xpmem/default/lib64 -o lib/libmpifort.la -rpath /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib src/binding/fortran/mpif_h/lib_libmpifort_la-attr_proxy.lo src/binding/fortran/mpif_h/lib_libmpifort_la-fortran_binding.lo src/binding/fortran/mpif_h/lib_libmpifort_la-fdebug.lo src/binding/fortran/mpif_h/lib_libmpifort_la-setbot.lo src/binding/fortran/mpif_h/setbotf.lo src/binding/fortran/use_mpi/mpi.lo src/binding/fortran/use_mpi/mpi_constants.lo src/binding/fortran/use_mpi/mpi_sizeofs.lo src/binding/fortran/use_mpi/mpi_base.lo src/binding/fortran/use_mpi_f08/pmpi_f08.lo src/binding/fortran/use_mpi_f08/mpi_f08.lo src/binding/fortran/use_mpi_f08/mpi_f08_callbacks.lo src/binding/fortran/use_mpi_f08/mpi_f08_compile_constants.lo src/binding/fortran/use_mpi_f08/mpi_f08_link_constants.lo src/binding/fortran/use_mpi_f08/mpi_f08_types.lo src/binding/fortran/use_mpi_f08/mpi_c_interface.lo src/binding/fortran/use_mpi_f08/mpi_c_interface_cdesc.lo src/binding/fortran/use_mpi_f08/mpi_c_interface_glue.lo src/binding/fortran/use_mpi_f08/mpi_c_interface_nobuf.lo src/binding/fortran/use_mpi_f08/mpi_c_interface_types.lo src/binding/fortran/use_mpi_f08/wrappers_f/f08ts.lo src/binding/fortran/use_mpi_f08/wrappers_f/pf08ts.lo src/binding/fortran/use_mpi_f08/wrappers_c/lib_libmpifort_la-f08_cdesc.lo src/binding/fortran/use_mpi_f08/wrappers_c/lib_libmpifort_la-cdesc.lo src/binding/fortran/use_mpi_f08/wrappers_c/lib_libmpifort_la-comm_spawn_c.lo src/binding/fortran/use_mpi_f08/wrappers_c/lib_libmpifort_la-comm_spawn_multiple_c.lo src/binding/fortran/use_mpi_f08/wrappers_c/lib_libmpifort_la-utils.lo lib/libmpi.la -lpmi -lpmi2 -lfabric -Wl,--as-needed,-lcudart,--no-as-needed -lcuda )
libtool: warning: '/opt/cray/pe/gcc/11.2.0/snos/lib64/libatomic.la' seems to be moved
libtool: warning: '/opt/cray/pe/gcc/11.2.0/snos/lib64/libstdc++.la' seems to be moved
libtool: install: /usr/bin/install -c lib/.libs/libmpifort.so.0.0.0T /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib/libmpifort.so.0.0.0
libtool: install: (cd /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib && { ln -s -f libmpifort.so.0.0.0 libmpifort.so.0 || { rm -f libmpifort.so.0 && ln -s libmpifort.so.0.0.0 libmpifort.so.0; }; })
libtool: install: (cd /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib && { ln -s -f libmpifort.so.0.0.0 libmpifort.so || { rm -f libmpifort.so && ln -s libmpifort.so.0.0.0 libmpifort.so; }; })
libtool: install: /usr/bin/install -c lib/.libs/libmpifort.lai /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib/libmpifort.la
libtool: warning: relinking 'lib/libmpicxx.la'
libtool: install: (cd /global/u2/m/mhaseeb/repos/part-mpi/mpich.IHudebo/mpich; /bin/sh "/global/u2/m/mhaseeb/repos/part-mpi/mpich.IHudebo/mpich/libtool"  --silent --tag CXX --mode=relink g++ -fPIC -O2 -version-info 0:0:0 -L/opt/cray/pe/pmi/default/lib -L/opt/cray/libfabric/1.15.2.0/lib64 -L/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/lib64 -L/opt/cray/xpmem/default/lib -L/opt/cray/xpmem/default/lib64 -o lib/libmpicxx.la -rpath /global/homes/m/mhaseeb/repos/part-mpi/./mpich-install-gnu/gnu/lib src/binding/cxx/initcxx.lo lib/libmpi.la -lpmi -lpmi2 -lfabric -Wl,--as-needed,-lcudart,--no-as-needed -lcuda )
libtool: warning: '/opt/cray/pe/gcc/11.2.0/snos/lib64/libatomic.la' seems to be moved
libtool: warning: '/opt/cray/pe/gcc/11.2.0/snos/lib64/libstdc++.la' seems to be moved

Any help with the infinite loop would be appreciated. Thanks.

mhaseeb123 avatar Jun 22 '23 23:06 mhaseeb123

@mhaseeb123 thanks for reaching out. FWIW the branch is disconnected from the main and it's on my summer todo to merge it back :-)

Have you tried with the main branch and does it work over there? Could you also detail your config options (especially the vci ones). Also, out of curiosity, what is your use case for this?

thomasgillis avatar Jun 23 '23 16:06 thomasgillis

@thomasgillis thank you for your response. Yes, I am aware the branch is disconnected from the main branch. Yes, I did build the pmodels:mpich:main branch using the same everything and it does work perfectly (except the segfault error which is bypassed if CUDA_VISIBLE_DEVICES is set). Maybe I should also build the thomasgillis:main and see if it also works.

I was thinking of merging the code from thomasgillis:part-early-bird into pmodels:mpich:main and see if it works. I wanted to study the advantage and application of partitioned communication feature in a couple of scientific codes.

Yes, here is my config command. Assume all vars are appropriately set.

./configure ${opts} \
            CPPFLAGS=${CPPFLAGS} \
            CC=${CC} \
            CFLAGS=${CFLAGS} \
            CXX=${CXX} \
            FC=${FC} \
            FCFLAGS=${FCFLAGS} \
            F77=${FC} \
            FFLAGS=${FCFLAGS} \
            LIBS="${LIBS}" \
            LDFLAGS="${LDFLAGS}" \
            MPICHLIB_CFLAGS="-fPIC" \
            MPICHLIB_CXXFLAGS="-fPIC" \
            MPICHLIB_FFLAGS="-fPIC" \
            MPICHLIB_FCFLAGS="-fPIC"

Here opts = --with-cuda --with-device=ch4:ofi --with-libfabric=${FABRIC_PREFIX} --with-libfabric-include=${FABRIC_INCLUDE} --with-libfabric-lib=${FABRIC_LIB_DIR} --enable-fast=O2 --with-pm=no --with-pmi=cray --with-xpmem=${XPMEM} --with-wrapper-dl-type=rpath --enable-threads=multiple --enable-shared=yes --enable-static=no --with-namepublisher=file

mhaseeb123 avatar Jun 23 '23 19:06 mhaseeb123

Ok, do you mind sharing your testcode? it will be easier to debug Also, the issue you observe, is it in MPI_Init or MPI_PSend/Precv_init?

Finally, could you report the cuda visible issue separately?

thomasgillis avatar Jun 23 '23 20:06 thomasgillis

test:mpich/custom netmod:ch4:ofi testlist:part

thomasgillis avatar Jun 23 '23 20:06 thomasgillis

Right now I am only literally running MPI Hello World which is as follows:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    MPI_Init(&argc, &argv);

    // see if MPI came out of init
    //printf("Hello!");
    //fflush(stdout);

    int comm_sz;
    MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    char hostname[MPI_MAX_PROCESSOR_NAME];
    int name_sz;
    MPI_Get_processor_name(hostname, &name_sz);

    printf("Hello from %s, rank %d (of %d).\n", hostname, rank, comm_sz);

    MPI_Finalize();
    return 0;
}

The code never prints anything after MPI_Init so I am assuming it's stuck there. Yes, I can report that issue separately, it's not really causing any blocking issues. Here is what I see when I run it with only 2 ranks.

image

mhaseeb123 avatar Jun 23 '23 21:06 mhaseeb123

ok, it doesn't sound like it's related to this PR then :-) I will try to rebase asap to incorporate the recent fixes and it will hopefully fix the issue

thomasgillis avatar Jun 23 '23 21:06 thomasgillis

Yayy! I am hoping that as well. Hence, my thought of locally merging from this PR to the current pmodels:mpich:main. Thank you for your help @thomasgillis, really appreciate it!

Update: I built from the thomasgillis:main branch and got the same hanging issue inside MPI_Init so I think rebasing from current mpich:main would def solve it. I am also working on locally patching the mpich:main with changes from this PR and seeing if that would help! Thanks again, cheers!

mhaseeb123 avatar Jun 23 '23 21:06 mhaseeb123

Hi @thomasgillis sorry for bothering again. I tried to merge your part-early-bird branch in my local fork of mpich:main but due to my unfamiliarity with the implementation, all my rebasing efforts led to buggy merged code which couldn't compile. I would really appreciate it if you could rebase your branch at your earliest convenience. Thank you and looking forward to trying out the new partitioned communication feature in MPICH!

mhaseeb123 avatar Jun 30 '23 17:06 mhaseeb123

test:mpich/custom netmod:ch4:ofi testlist:part

thomasgillis avatar Aug 17 '23 18:08 thomasgillis

@mhaseeb123 I have just rebased it, going to work on the merge soon.

FYI they are a few work exploring the partitioned communication usage and when it's valuable, including our recent publication: https://arxiv.org/abs/2308.03930

thomasgillis avatar Aug 17 '23 18:08 thomasgillis

@thomasgillis thank you for the update. I will check out the preprint today. I have been able to merge your branch into my local fork without issues so that's a good sign so far. I will build from it in a bit and run some examples/tests. Thank you again for doing this! Really appreciate it. I will let you know if I unlikely run into any issues!

mhaseeb123 avatar Aug 17 '23 21:08 mhaseeb123