MPI_Allreduce throws an assertion on Aurora
While hitting MPI_Allreduce triggers an assertion as follows. This showed up with E3SM app on Aurora. More importantly only when loading this module module load mpich-config/collective-tuning/1024.
Assertion failed in file src/mpid/common/shm/mpidu_shm_alloc.c at line 692: shm_seg != NULL
backtrace:
#34 0x0000145aa15e1a92 in MPID_Abort (comm=0x0, mpi_errno=0, exit_code=1, error_msg=0x145aa18954c0 "Internal error") at src/mpid/ch4/src/ch4_globals.c:126
#35 0x0000145aa15701b3 in MPIR_Assert_fail (cond=0x145aa18a3dfc "shm_seg != NULL", file_name=0x145aa18a3d23 "src/mpid/common/shm/mpidu_shm_alloc.c", line_num=692) at src/util/mpir_assert.c:28
#36 0x0000145aa175bb45 in MPIDU_shm_free (ptr=0x0) at src/mpid/common/shm/mpidu_shm_alloc.c:692
#37 0x0000145aa1714ff6 in MPIDI_POSIX_mpi_release_gather_comm_free (comm_ptr=0x145a69450c50) at src/mpid/ch4/shm/posix/release_gather/release_gather.c:507
#38 0x0000145aa1714e1c in MPIDI_POSIX_mpi_release_gather_comm_init (comm_ptr=0x145a69450c50, operation=MPIDI_POSIX_RELEASE_GATHER_OPCODE_ALLREDUCE) at src/mpid/ch4/shm/posix/release_gather/release_gather.c:480
#39 0x0000145aa13ba1d7 in MPIDI_POSIX_mpi_allreduce_release_gather (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm_ptr=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/shm/src/../posix/posix_coll_release_gather.h:306
#40 0x0000145aa13b9cdf in MPIDI_POSIX_mpi_allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/shm/src/../posix/posix_coll.h:332
#41 0x0000145aa13b99ba in MPIDI_SHM_mpi_allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/shm/src/shm_coll.h:47
#42 0x0000145aa13b8352 in MPIDI_Allreduce_intra_composition_gamma (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/src/ch4_coll_impl.h:709
#43 0x0000145aa13b9381 in MPIDI_Allreduce_allcomm_composition_json (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/src/ch4_coll.h:385
#44 0x0000145aa137e7ca in MPID_Allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=0x145a69450c50, errflag=MPIR_ERR_NONE) at ./src/mpid/ch4/src/ch4_coll.h:493
#45 0x0000145aa137e145 in MPIR_Allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm_ptr=0x145a69450c50, errflag=MPIR_ERR_NONE) at src/mpi/coll/mpir_coll.c:4862
#46 0x0000145aa0cd0534 in internal_Allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1283655711, op=1476395011, comm=-1006632860) at src/binding/c/coll/allreduce.c:126
#47 0x0000145aa0ccf179 in PMPI_Allreduce (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=10800, datatype=1275070495, op=1476395011, comm=-1006632860) at src/binding/c/coll/allreduce.c:179
#48 0x0000145aa1f82598 in pmpi_allreduce_ (sendbuf=0xc013a9a0, recvbuf=0xc0177a60, count=0x7ffdf725554c, datatype=0x46403ec <__unnamed_4>, op=0x46404c0 <__unnamed_9>, comm=0x7ffdf7254ed8, ierr=0x7ffdf725492c) at src/binding/fortran/mpif_h/fortran_binding.c:435
#include <iostream>
#include <vector>
#include <mpi.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int world_rank, world_size;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
const int nElements = 10800; // As per your original Fortran code
const int num_iterations = 10; // Run multiple iterations to increase the chance of reproduction
std::vector<double> inArray(nElements);
std::vector<double> outArray(nElements);
for (int i = 0; i < nElements; ++i)
inArray[i] = static_cast<double>(world_rank * nElements + i);
std::cout << "Rank " << world_rank << ": Starting MPI_Allreduce with " << nElements << " elements." << std::endl;
for (int iter = 0; iter < num_iterations; ++iter) {
std::fill(outArray.begin(), outArray.end(), 0.0);
int mpi_ierr = MPI_Allreduce(inArray.data(), outArray.data(), nElements, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
if (mpi_ierr != MPI_SUCCESS) {
char error_string[MPI_MAX_ERROR_STRING];
int length_of_error_string;
MPI_Error_string(mpi_ierr, error_string, &length_of_error_string);
std::cerr << "Rank " << world_rank << ": MPI_Allreduce failed with error: " << error_string << std::endl;
}
}
std::cout << "Rank " << world_rank << ": Finished MPI_Allreduce iterations." << std::endl;
MPI_Finalize();
return 0;
}
Compiled as:
mpicxx -g -O0 -I${MPI_ROOT}/include -L${MPI_ROOT}/lib -lmpi mpich_all_reduce_e3sm.cpp
Ran with these modules and env vars:
module load mpich/dbg/develop-git.6037a7a
module load mpich-config/collective-tuning/1024
module list
export FI_MR_CACHE_MONITOR=disabled
export ZES_ENABLE_SYSMAN=1
export MPIR_CVAR_ENABLE_GPU=1
mpiexec -n 48 --ppn 12 --cpu-bind list:1-8:9-16:17-24:25-32:33-40:41-48:53-60:61-68:69-76:77-84:85-92:93-100 --gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1 --mem-bind list:0:0:0:0:0:0:1:1:1:1:1:1 -d 1 ./mpich_allreduce_e3sm.out
Shared memory allocation must have failed during release gather collectives initialization. This should not be a fatal error. We'll have to scrub the error paths to make sure they just do their best at cleanup and allow regular collectives to proceed.
Is this 100% reproducible?
Is this 100% reproducible?
Yes, it is. Only when loading module load mpich-config/collective-tuning/1024
Can you try the fix in https://github.com/pmodels/mpich/pull/7457? Also if you can attach your reproducer and how to run to the issue for reference that would be great.
Can you try the fix in #7457? Also if you can attach your reproducer and how to run to the issue for reference that would be great.
@raffenet Will give it try on my end. Did you not run into the issue when using the above reproducer
Can you try the fix in #7457? Also if you can attach your reproducer and how to run to the issue for reference that would be great.
@raffenet Will give it try on my end. Did you not run into the issue when using the above reproducer
Oh, sorry. I skimmed right past that when I read the issue. I will try it myself.
I can confirm the patch in #7457 allows the reproducer to complete. I'm going to try and get to the bottom of what failed in the release gather init, because this looks simple enough that I'm surprised by the failure.
I can confirm the patch in #7457 allows the reproducer to complete. I'm going to try and get to the bottom of what failed in the release gather init, because this looks simple enough that I'm surprised by the failure.
Something is going wrong at the shm filename exchange stage. Processes in the shm communicator are unable to open the file created by the root process because it does not exist. Needs more investigation still.
I can confirm the patch in #7457 allows the reproducer to complete. I'm going to try and get to the bottom of what failed in the release gather init, because this looks simple enough that I'm surprised by the failure.
Something is going wrong at the shm filename exchange stage. Processes in the shm communicator are unable to open the file created by the root process because it does not exist. Needs more investigation still.
MPICH is trying to use MPIDI_Allreduce_intra_composition_gamma for this collective operation, but that algorithm is supposed to be restricted to communicators spanning only a single node. This appears to be a bug in the tuning file @zhenggb72 @abrooks98.
Hi @raffenet. Thank for you identifying the problem. We will fix the JSON file and get it updated on Aurora.
The issue was due to loading Intel's custom Json configuration file, which is at fault. Intel is aware of the issue and working on a fix. By default, Aurora does not set Intel's Json file, so the issue is not general. We will close this issue for now. Reopen if necessary