Albany
Albany copied to clipboard
Broken Perlmutter GPU build possibly due to STK changes in Trilinos
Albany is failing on my Perlmutter CUDA build (A100 GPUs) with the following error:
MPICH ERROR [Rank 0] [job id ] [Thu Jul 7 13:36:09 2022] [nid003709] - Abort(940120067) (rank 0 in comm 0): Fatal error in PMPI_Allreduce: Invalid datatype, error stack:
PMPI_Allreduce(472): MPI_Allreduce(sbuf=0x7fffffff304b, rbuf=0x7fffffff3050, count=1, datatype=MPI_DATATYPE_NULL, op=MPI_LOR, comm=MPI_COMM_WORLD) failed
PMPI_Allreduce(415): Datatype for argument datatype is a null datatype
I can't pinpoint the exact Trilinos/Albany commit for which this started appearing but it was after I pulled the changes that fixed Issue https://github.com/sandialabs/Albany/issues/814. After some debugging, it appears to be an issue related to STK.
After taking a look in the debugger, it seems to be that the Allreduce that is failing is using MPI_CXX_BOOL
as a datatype which for some reason is undefined and is being treated like MPI_DATATYPE_NULL
. I've attached here the full debug output with backtrace and list of modules used.
My guess was that it had something to do with cray-mpich since this issue isn't showing up on the nightly tests on Weaver (V100 GPUs) but I could not reproduce on Cori so now I think it might be something related to CUDA-aware MPI.
@alanw0 Any ideas on what might be causing this?
We did recently start using the MPI_CXX_BOOL type in a couple of cases in stk. I'll refer this to our MPI guru and see what he thinks. I wonder if there's a macro that indicates whether MPI_CXX_BOOL is or is not available...
Yes, that stacktrace points directly to the change I made. MPI_CXX_BOOL
is required to exist by the standard as long as the MPI library is compiled with C++ support.
A few things to look at:
- find
mpi.h
. Usually can be found by-
cd
into the directory returned bywhich mpicxx
, - then do
cd ../include
-
- in
mpi.h
, search forMPI_VERSION
andMPI_SUBVERSION
- in
mpi.h
, search forMPI_DATATYPE_NULL
andMPI_CXX_BOOL
(alternatively you could write a short program to print out these constants).
MPI_CXX_BOOL
was introduced in MPI version 3.0, so if the version is older than that, it would be a problem. The other question is whether MPI_CXX_BOOL
is really equal to MPI_DATATYPE_NULL
, or if there is some problem with how MPI_Allreduce
is handling MPI_CXX_BOOL
.
Unfortunately, I don't have access to any Cray machines, so I can't look into this myself.
The version of MPI is 3.1 but when I searched for MPI_CXX_BOOL
I found
/* MPI-3 C++ types */
#define MPI_CXX_BOOL ((MPI_Datatype)0x0c000000)
#define MPI_CXX_FLOAT_COMPLEX ((MPI_Datatype)0x0c000000)
#define MPI_CXX_DOUBLE_COMPLEX ((MPI_Datatype)0x0c000000)
#define MPI_CXX_LONG_DOUBLE_COMPLEX ((MPI_Datatype)0x0c000000)
All of the C++ types are equivalent to MPI_DATATYPE_NULL
. Looks like I'll need to reach out to Perlmutter support on this one. Thanks!
I checked with NERSC support and this is apparently a known issue on Perlmutter. They've opened a ticket with HPE earlier this year but there isn't really an expected timeline for a fix. I've got a workaround in place for the short term and I will update/close this ticket once it is resolved.
@mperego This is the issue mentioned in the Albany meeting
Sorry for not following up on this. This issue was fixed by this stk update: https://github.com/trilinos/Trilinos/pull/10914
Awesome, thanks! That's great to hear. When I have a moment I'll update my Perlmutter build and close this issue.
Confirmed this fixed our Perlmutter issue, thanks again!
thanks @alanw0 and @mcarlson801!