Albany icon indicating copy to clipboard operation
Albany copied to clipboard

Broken Perlmutter GPU build possibly due to STK changes in Trilinos

Open mcarlson801 opened this issue 2 years ago • 4 comments

Albany is failing on my Perlmutter CUDA build (A100 GPUs) with the following error:

MPICH ERROR [Rank 0] [job id ] [Thu Jul  7 13:36:09 2022] [nid003709] - Abort(940120067) (rank 0 in comm 0): Fatal error in PMPI_Allreduce: Invalid datatype, error stack:
PMPI_Allreduce(472): MPI_Allreduce(sbuf=0x7fffffff304b, rbuf=0x7fffffff3050, count=1, datatype=MPI_DATATYPE_NULL, op=MPI_LOR, comm=MPI_COMM_WORLD) failed
PMPI_Allreduce(415): Datatype for argument datatype is a null datatype

I can't pinpoint the exact Trilinos/Albany commit for which this started appearing but it was after I pulled the changes that fixed Issue https://github.com/sandialabs/Albany/issues/814. After some debugging, it appears to be an issue related to STK.

After taking a look in the debugger, it seems to be that the Allreduce that is failing is using MPI_CXX_BOOL as a datatype which for some reason is undefined and is being treated like MPI_DATATYPE_NULL. I've attached here the full debug output with backtrace and list of modules used.

prlm_debug1.txt

My guess was that it had something to do with cray-mpich since this issue isn't showing up on the nightly tests on Weaver (V100 GPUs) but I could not reproduce on Cori so now I think it might be something related to CUDA-aware MPI.

@alanw0 Any ideas on what might be causing this?

mcarlson801 avatar Jul 08 '22 16:07 mcarlson801

We did recently start using the MPI_CXX_BOOL type in a couple of cases in stk. I'll refer this to our MPI guru and see what he thinks. I wonder if there's a macro that indicates whether MPI_CXX_BOOL is or is not available...

alanw0 avatar Jul 08 '22 17:07 alanw0

Yes, that stacktrace points directly to the change I made. MPI_CXX_BOOL is required to exist by the standard as long as the MPI library is compiled with C++ support.

A few things to look at:

  • find mpi.h. Usually can be found by
    • cd into the directory returned by which mpicxx,
    • then do cd ../include
  • in mpi.h, search for MPI_VERSION and MPI_SUBVERSION
  • in mpi.h, search for MPI_DATATYPE_NULL and MPI_CXX_BOOL

(alternatively you could write a short program to print out these constants).

MPI_CXX_BOOL was introduced in MPI version 3.0, so if the version is older than that, it would be a problem. The other question is whether MPI_CXX_BOOL is really equal to MPI_DATATYPE_NULL, or if there is some problem with how MPI_Allreduce is handling MPI_CXX_BOOL.

Unfortunately, I don't have access to any Cray machines, so I can't look into this myself.

JaredCrean2 avatar Jul 08 '22 18:07 JaredCrean2

The version of MPI is 3.1 but when I searched for MPI_CXX_BOOL I found

/* MPI-3 C++ types */
#define MPI_CXX_BOOL                ((MPI_Datatype)0x0c000000)
#define MPI_CXX_FLOAT_COMPLEX       ((MPI_Datatype)0x0c000000)
#define MPI_CXX_DOUBLE_COMPLEX      ((MPI_Datatype)0x0c000000)
#define MPI_CXX_LONG_DOUBLE_COMPLEX ((MPI_Datatype)0x0c000000)

All of the C++ types are equivalent to MPI_DATATYPE_NULL. Looks like I'll need to reach out to Perlmutter support on this one. Thanks!

mcarlson801 avatar Jul 08 '22 18:07 mcarlson801

I checked with NERSC support and this is apparently a known issue on Perlmutter. They've opened a ticket with HPE earlier this year but there isn't really an expected timeline for a fix. I've got a workaround in place for the short term and I will update/close this ticket once it is resolved.

mcarlson801 avatar Jul 11 '22 16:07 mcarlson801

@mperego This is the issue mentioned in the Albany meeting

mcarlson801 avatar Sep 27 '22 16:09 mcarlson801

Sorry for not following up on this. This issue was fixed by this stk update: https://github.com/trilinos/Trilinos/pull/10914

alanw0 avatar Sep 27 '22 17:09 alanw0

Awesome, thanks! That's great to hear. When I have a moment I'll update my Perlmutter build and close this issue.

mcarlson801 avatar Sep 27 '22 17:09 mcarlson801

Confirmed this fixed our Perlmutter issue, thanks again!

mcarlson801 avatar Sep 28 '22 16:09 mcarlson801

thanks @alanw0 and @mcarlson801!

mperego avatar Sep 28 '22 16:09 mperego