mpich icon indicating copy to clipboard operation
mpich copied to clipboard

MPI_Allreduce throws an assertion on Aurora

Open abagusetty opened this issue 6 months ago • 8 comments

While hitting MPI_Allreduce triggers an assertion as follows. This showed up with E3SM app on Aurora. More importantly only when loading this module module load mpich-config/collective-tuning/1024. Assertion failed in file src/mpid/common/shm/mpidu_shm_alloc.c at line 692: shm_seg != NULL

backtrace:

#34 0x0000145aa15e1a92 in MPID_Abort (comm=0x0, mpi_errno=0, exit_code=1, error_msg=0x145aa18954c0 "Internal error") at src/mpid/ch4/src/ch4_globals.c:126
#35 0x0000145aa15701b3 in MPIR_Assert_fail (cond=0x145aa18a3dfc "shm_seg != NULL", file_name=0x145aa18a3d23 "src/mpid/common/shm/mpidu_shm_alloc.c", line_num=692) at src/util/mpir_assert.c:28
#36 0x0000145aa175bb45 in MPIDU_shm_free (ptr=0x0) at src/mpid/common/shm/mpidu_shm_alloc.c:692
#37 0x0000145aa1714ff6 in MPIDI_POSIX_mpi_release_gather_comm_free (comm_ptr=0x145a69450c50) at src/mpid/ch4/shm/posix/release_gather/release_gather.c:507
#38 0x0000145aa1714e1c in MPIDI_POSIX_mpi_release_gather_comm_init (comm_ptr=0x145a69450c50, operation=MPIDI_POSIX_RELEASE_GATHER_OPCODE_ALLREDUCE) at src/mpid/ch4/shm/posix/release_gather/release_gather.c:480
#39 0x0000145aa13ba1d7 in MPIDI_POSIX_mpi_allreduce_release_gather (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm_ptr=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/shm/src/../posix/posix_coll_release_gather.h:306
#40 0x0000145aa13b9cdf in MPIDI_POSIX_mpi_allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/shm/src/../posix/posix_coll.h:332
#41 0x0000145aa13b99ba in MPIDI_SHM_mpi_allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/shm/src/shm_coll.h:47
#42 0x0000145aa13b8352 in MPIDI_Allreduce_intra_composition_gamma (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/src/ch4_coll_impl.h:709
#43 0x0000145aa13b9381 in MPIDI_Allreduce_allcomm_composition_json (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/src/ch4_coll.h:385
#44 0x0000145aa137e7ca in MPID_Allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/src/ch4_coll.h:493
#45 0x0000145aa137e145 in MPIR_Allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm_ptr=0x145a69450c50, errflag=MPIR_ERR_NONE) at src/mpi/coll/mpir_coll.c:4862
#46 0x0000145aa0cd0534 in internal_Allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=-1006632860) at src/binding/c/coll/allreduce.c:126
#47 0x0000145aa0ccf179 in PMPI_Allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1275070495, op=1476395011, comm=-1006632860) at src/binding/c/coll/allreduce.c:179
#48 0x0000145aa1f82598 in pmpi_allreduce_ (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=0x7ffdf725554c, datatype=0x46403ec <__unnamed_4>, op=0x46404c0 <__unnamed_9>, comm=0x7ffdf7254ed8, ierr=0x7ffdf725492c) at src/binding/fortran/mpif_h/fortran_binding.c:435

#include <iostream>
#include <vector>
#include <mpi.h>

int main(int argc, char* argv[]) {
    MPI_Init(&argc, &argv);

    int world_rank, world_size;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    const int nElements = 10800; // As per your original Fortran code
    const int num_iterations = 10; // Run multiple iterations to increase the chance of reproduction

    std::vector<double> inArray(nElements);
    std::vector<double> outArray(nElements);

    for (int i = 0; i < nElements; ++i) 
        inArray[i] = static_cast<double>(world_rank * nElements + i);

    std::cout << "Rank " << world_rank << ": Starting MPI_Allreduce with " << nElements << " elements." << std::endl;

    for (int iter = 0; iter < num_iterations; ++iter) {
        std::fill(outArray.begin(), outArray.end(), 0.0);

        int mpi_ierr = MPI_Allreduce(inArray.data(), outArray.data(), nElements, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

        if (mpi_ierr != MPI_SUCCESS) {
            char error_string[MPI_MAX_ERROR_STRING];
            int length_of_error_string;
            MPI_Error_string(mpi_ierr, error_string, &length_of_error_string);
            std::cerr << "Rank " << world_rank << ": MPI_Allreduce failed with error: " << error_string << std::endl;
        }
    }

    std::cout << "Rank " << world_rank << ": Finished MPI_Allreduce iterations." << std::endl;
    MPI_Finalize();
    return 0;
}

Compiled as:

mpicxx -g -O0 -I${MPI_ROOT}/include -L${MPI_ROOT}/lib -lmpi mpich_all_reduce_e3sm.cpp

Ran with these modules and env vars:

module load mpich/dbg/develop-git.6037a7a
module load mpich-config/collective-tuning/1024
module list

export FI_MR_CACHE_MONITOR=disabled
export ZES_ENABLE_SYSMAN=1
export MPIR_CVAR_ENABLE_GPU=1

mpiexec -n 48 --ppn 12 --cpu-bind list:1-8:9-16:17-24:25-32:33-40:41-48:53-60:61-68:69-76:77-84:85-92:93-100 --gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1 --mem-bind list:0:0:0:0:0:0:1:1:1:1:1:1 -d 1 ./mpich_allreduce_e3sm.out

abagusetty avatar Jun 13 '25 22:06 abagusetty

Shared memory allocation must have failed during release gather collectives initialization. This should not be a fatal error. We'll have to scrub the error paths to make sure they just do their best at cleanup and allow regular collectives to proceed.

raffenet avatar Jun 16 '25 14:06 raffenet

Is this 100% reproducible?

hzhou avatar Jun 16 '25 14:06 hzhou

Is this 100% reproducible?

Yes, it is. Only when loading module load mpich-config/collective-tuning/1024

abagusetty avatar Jun 16 '25 15:06 abagusetty

Can you try the fix in https://github.com/pmodels/mpich/pull/7457? Also if you can attach your reproducer and how to run to the issue for reference that would be great.

raffenet avatar Jun 16 '25 15:06 raffenet

Can you try the fix in #7457? Also if you can attach your reproducer and how to run to the issue for reference that would be great.

@raffenet Will give it try on my end. Did you not run into the issue when using the above reproducer

abagusetty avatar Jun 16 '25 15:06 abagusetty

Can you try the fix in #7457? Also if you can attach your reproducer and how to run to the issue for reference that would be great.

@raffenet Will give it try on my end. Did you not run into the issue when using the above reproducer

Oh, sorry. I skimmed right past that when I read the issue. I will try it myself.

raffenet avatar Jun 16 '25 15:06 raffenet

I can confirm the patch in #7457 allows the reproducer to complete. I'm going to try and get to the bottom of what failed in the release gather init, because this looks simple enough that I'm surprised by the failure.

raffenet avatar Jun 16 '25 19:06 raffenet

I can confirm the patch in #7457 allows the reproducer to complete. I'm going to try and get to the bottom of what failed in the release gather init, because this looks simple enough that I'm surprised by the failure.

Something is going wrong at the shm filename exchange stage. Processes in the shm communicator are unable to open the file created by the root process because it does not exist. Needs more investigation still.

raffenet avatar Jun 16 '25 21:06 raffenet

I can confirm the patch in #7457 allows the reproducer to complete. I'm going to try and get to the bottom of what failed in the release gather init, because this looks simple enough that I'm surprised by the failure.

Something is going wrong at the shm filename exchange stage. Processes in the shm communicator are unable to open the file created by the root process because it does not exist. Needs more investigation still.

MPICH is trying to use MPIDI_Allreduce_intra_composition_gamma for this collective operation, but that algorithm is supposed to be restricted to communicators spanning only a single node. This appears to be a bug in the tuning file @zhenggb72 @abrooks98.

raffenet avatar Jun 17 '25 15:06 raffenet

Hi @raffenet. Thank for you identifying the problem. We will fix the JSON file and get it updated on Aurora.

rithwiktom avatar Jun 18 '25 16:06 rithwiktom

The issue was due to loading Intel's custom Json configuration file, which is at fault. Intel is aware of the issue and working on a fix. By default, Aurora does not set Intel's Json file, so the issue is not general. We will close this issue for now. Reopen if necessary

hzhou avatar Jul 02 '25 18:07 hzhou