CTSM
CTSM copied to clipboard
Test SMS_D_P1.ne3pg3_ne3pg3_mt232.IHistClm60Sp.izumi_gnu FAILS at runtime
Brief summary of bug
This was originally detected using a cam test SMS_D_Ln9_P1.ne3pg3_ne3pg3_mt232.FHISTC_LTso.izumi_gnu.cam-outfrq9s Brian Eaton observed that if you make this a QPC6 test (removing ctsm) that it passes. So I created the ctsm test and found that it fails in the same way. It runs if you use P2 instead P1 and we need to use P1 here instead of Mmpi-serial since cam7 physics is no longer supported by mpi-serial and we decided to run mpi on a single task rather than extend mpi-serial in the hope that we can fully deprecate mpi-serial eventually.
General bug information
CTSM version you are using: ctsm5.3.029
Does this bug cause significantly incorrect results in the model's science? No
Configurations affected: System - izumi (the same test appears to run correctly on derecho)
Details of bug
Important details of your setup / configuration so we can reproduce the bug
Important output or errors that show the problem
Thanks @jedwards4b. I'll start looking into this.
I have determined that the code is crashing in https://github.com/ESCOMP/CTSM/blob/master/src/main/clm_instMod.F90#L373
I have not conclusivly proven this but I think that the problem may be in esmf - we are using esmf 8.6.1 on izumi while derecho is using 8.8.1. I have traced the problem to the init method in dshr_strdata_mod.F90 which includes a number of calls into the ESMF library. Still investigating.
https://github.com/ESCOMP/CDEPS/blob/main/streams/dshr_strdata_mod.F90#L438
I wrote the following to [email protected]:
I have a code that is crashing on Izumi in a call to ESMF_MeshCreate.
It fails when I use only 1 mpi task, but passes when I use 2 or more tasks - I think that it also works if I use the serial mpi implementation but we are trying to get away from that and just always use mpi. At first I thought that it might be due to using an older version of esmf (8.6,1) but I built 8.8.1 today and found that it has the same problem.The mesh file is for a 0.25x0.25 grid so it's fairly high resolution, but I think that even though I am only using a single task I have access to the memory of the entire node. The file (on izumi) is /fs/cgd/csm/inputdata/lnd/clm2/dustemisdata/dust_0.25x0.25_ESMFmesh_cdf5_c240222.nc
Any suggestions?
After updating to
# This file is for user convenience only and is not used by the model
# Changes to this file will be ignored and overwritten
# Changes to the environment should be made in env_mach_specific.xml
# Run ./case.setup --reset to regenerate this file
. /usr/share/Modules/init/sh
module purge
module load lang/python/3.11.5
module use /fs/cgd/data0/modules/modulefiles
module load compiler/gnu/12.4.0 tool/netcdf/4.9.3/gnu/12.4.0 mpi/2.3.7/gnu/12.4.0 mvapich2/2.3.7/gnu/12.4.0/parallelio/2.6.6 esmfpkgs/gfortran/12.4.0/esmf-8.8.1-ncdfio-mvapich2-O
export OMP_STACKSIZE=64M
export PATH=/project/esmf/PROGS/esmf/8.8.1/mvapich2/2.3.7/gfortran/12.4.0/bin/binO/Linux.gfortran.64.mvapich2.default:/usr/local/hdf5-1.14.6-gcc-g++-gfortran-12.4.0/bin:/usr/local/netcdf-c-4.9.3-f-4.6.2-gcc-g++-gfortran-12.4.0/bin:/cluster/mvapich2-2.3.7-gcc-g++-gfortran-12.4.0/bin:/usr/local/gcc-g++-gfortran-12.4.0/bin:/cluster/anaconda-3.11.5/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/jedwards/bin:/cluster/torque/bin:/cluster/torque/bin
I am able to run this test successfully.
This change is currently in https://github.com/jedwards4b/ccs_config_cesm.git branch izumi_gnu_update and will be in an upcoming PR.
The ccs_config changes for this are in:
https://github.com/ESMCI/ccs_config_cesm/pull/242
The one ne3pg3 test we have is:
SMS_Ln9.ne3pg3_ne3pg3_mt232.I2000Clm60Sp.derecho_gnu.clm-clm60cam7LndTuningMode--clm-nofireemis
Since this is now becoming more important in CAM testing we should add more tests for it and include one or more on izumi CAM includes testing for this for izumi_nag and izumi_gnu.
Adding next to assess if we should add any izumi_gnu tests to our aux_clm testlist.
In meeting we agree to not add izumi_gnu, but will add izumi_nag. Will also bring this up in CSEG.
OK, we still need to update to a later version of ccs_config for this to work.
There is CESM alpha testing going on, and we should update the cime and ccs_config versions in that, once that's completed. I'll create a subissue for doing that.
We discussed this at the CTSM SE meeting this morning. We have ccs_config updates planned to come in. I'll create a subissue for adding the tests we proposed.