E3SM MMF with P3/SHOC negative/NaN temperature (LOW PRIORITY)

I am testing the E3SM-MMF with new microphysics model P3/SHOC which are based on Kokkos backend on Crusher, and unable to use full GCDs of GPUs in one node, I can only run with maximum 4 gcds, when I run the case with 8 gcds, it crashed as below. But when I turn off MPI support in the code, I am able to run with up to 8 gcds. So it suggest to me that this is something related to MPICH GPU bind, could you help me to figure it out what is the reason? the script I used to build the code is as follows:

#!/bin/bash

source $MODULESHOME/init/bash module purge module load PrgEnv-cray/8.3.3 craype-accel-amd-gfx90a rocm/4.5.0 module load craype-network-ofi cray-mpich/8.1.14 module load cray-hdf5 cray-netcdf cmake

unset ARCH unset YAKL_ARCH unset NCRMS unset MACH unset CC unset CXX unset FC

export MPIR_CVAR_GPU_EAGER_DEVICE_MEM=0 export MPICH_GPU_SUPPORT_ENABLED=1 export MPICH_SMP_SINGLE_COPY_MODE=CMA

export YAKL_DEBUG=true export MACH="crusher" export NCHOME=${NETCDF_DIR} export NFHOME=${NETCDF_DIR} export MPIHOME=${CRAY_MPICH_DIR} export NCRMS=168 export CC=hipcc export CXX=hipcc export FC=ftn export FFLAGS=" -O3 -h noomp -h noacc -I${ROCM_PATH}/include " export CXXFLAGS=" -O3 -I${ROCM_PATH}/include " export ARCH="HIP" export YAKL_ARCH="HIP" export YAKL_HIP_FLAGS="-O3 -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --rocm-path=${ROCM_PATH} --offload-arch=gfx90a -x hip " export YAKL_HOME="pwd/../../../../../../../../externals/YAKL"

Crashed run information is:

:0: : Device-side assertion ' failed. :0: : Device-side assertion ' failed. :0:rocdevice.cpp :2589: 500268655608 us: Device::callbackQueue aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 Error! Temperature has <= 0 srun: error: crusher012: tasks 1-5,7: Aborted srun: launch/slurm: _step_signal: Terminating StepId=102297.52 slurmstepd: error: *** STEP 102297.52 ON crusher012 CANCELLED AT 2022-04-22T13:31:34 *** srun: error: crusher012: tasks 0,6: Terminated srun: Force Terminated StepId=102297.52

May 17 '22 19:05 xyuan

@xyuan since this configuration is all in an experimental phase and the machine isn't available to most, you're going to need more details about the reproducer, including the branch.

May 18 '22 14:05 whannah1

@xyuan since this configuration is all in an experimental phase and the machine isn't available to most, you're going to need more details about the reproducer, including the branch.

yeah, this is the required issue for HPE and AMD, since this is the place that they can access and interact with E3SM developers

May 18 '22 15:05 xyuan

Please use this branch in order to reproduce the error, https://github.com/xyuan/e3sm_p3_shoc/tree/e3sm_p3_shoc_hip

and the steps to run the standalone version of CRM code 1): git checkout the code 2): goto components/eam/src/physics/crm/samxx/test/build 3): source crusher_gpu.sh 4): cmakescript_hip.sh crm2d_32x1x8_nrad4x1x1_1024.nc crm3d_8x8x8_nrad4x1x1_1024.nc 5): goto cpp2d, and type make to build 2d test case 6): using srun -N1 -n4 -c1 --gpus-per-task=1 --gpu-bind=closest ./cpp2d to run

May 23 '22 18:05 xyuan

Status: From our meeting with HPE this week, we have decided to table this issue as there are known workarounds. Will revisit as needed.

Jun 29 '22 21:06 sarats