CTSM icon indicating copy to clipboard operation
CTSM copied to clipboard

Test SMS_D_P1.ne3pg3_ne3pg3_mt232.IHistClm60Sp.izumi_gnu FAILS at runtime

Open jedwards4b opened this issue 5 months ago • 6 comments

Brief summary of bug

This was originally detected using a cam test SMS_D_Ln9_P1.ne3pg3_ne3pg3_mt232.FHISTC_LTso.izumi_gnu.cam-outfrq9s Brian Eaton observed that if you make this a QPC6 test (removing ctsm) that it passes. So I created the ctsm test and found that it fails in the same way. It runs if you use P2 instead P1 and we need to use P1 here instead of Mmpi-serial since cam7 physics is no longer supported by mpi-serial and we decided to run mpi on a single task rather than extend mpi-serial in the hope that we can fully deprecate mpi-serial eventually.

General bug information

CTSM version you are using: ctsm5.3.029

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: System - izumi (the same test appears to run correctly on derecho)

Details of bug

Important details of your setup / configuration so we can reproduce the bug

Important output or errors that show the problem

jedwards4b avatar Jun 04 '25 19:06 jedwards4b

Thanks @jedwards4b. I'll start looking into this.

ekluzek avatar Jun 04 '25 19:06 ekluzek

I have determined that the code is crashing in https://github.com/ESCOMP/CTSM/blob/master/src/main/clm_instMod.F90#L373

jedwards4b avatar Jun 04 '25 19:06 jedwards4b

I have not conclusivly proven this but I think that the problem may be in esmf - we are using esmf 8.6.1 on izumi while derecho is using 8.8.1. I have traced the problem to the init method in dshr_strdata_mod.F90 which includes a number of calls into the ESMF library. Still investigating.

jedwards4b avatar Jun 04 '25 20:06 jedwards4b

https://github.com/ESCOMP/CDEPS/blob/main/streams/dshr_strdata_mod.F90#L438

jedwards4b avatar Jun 04 '25 20:06 jedwards4b

I wrote the following to [email protected]:

I have a code that is crashing on Izumi in a call to ESMF_MeshCreate.
It fails when I use only 1 mpi task, but passes when I use 2 or more tasks - I think that it also works if I use the serial mpi implementation but we are trying to get away from that and just always use mpi. At first I thought that it might be due to using an older version of esmf (8.6,1) but I built 8.8.1 today and found that it has the same problem.

The mesh file is for a 0.25x0.25 grid so it's fairly high resolution, but I think that even though I am only using a single task I have access to the memory of the entire node. The file (on izumi) is /fs/cgd/csm/inputdata/lnd/clm2/dustemisdata/dust_0.25x0.25_ESMFmesh_cdf5_c240222.nc

Any suggestions?

jedwards4b avatar Jun 06 '25 20:06 jedwards4b

After updating to

# This file is for user convenience only and is not used by the model
# Changes to this file will be ignored and overwritten
# Changes to the environment should be made in env_mach_specific.xml
# Run ./case.setup --reset to regenerate this file
. /usr/share/Modules/init/sh
module purge 
module load lang/python/3.11.5
module use /fs/cgd/data0/modules/modulefiles
module load compiler/gnu/12.4.0 tool/netcdf/4.9.3/gnu/12.4.0 mpi/2.3.7/gnu/12.4.0 mvapich2/2.3.7/gnu/12.4.0/parallelio/2.6.6 esmfpkgs/gfortran/12.4.0/esmf-8.8.1-ncdfio-mvapich2-O
export OMP_STACKSIZE=64M
export PATH=/project/esmf/PROGS/esmf/8.8.1/mvapich2/2.3.7/gfortran/12.4.0/bin/binO/Linux.gfortran.64.mvapich2.default:/usr/local/hdf5-1.14.6-gcc-g++-gfortran-12.4.0/bin:/usr/local/netcdf-c-4.9.3-f-4.6.2-gcc-g++-gfortran-12.4.0/bin:/cluster/mvapich2-2.3.7-gcc-g++-gfortran-12.4.0/bin:/usr/local/gcc-g++-gfortran-12.4.0/bin:/cluster/anaconda-3.11.5/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/jedwards/bin:/cluster/torque/bin:/cluster/torque/bin

I am able to run this test successfully.

This change is currently in https://github.com/jedwards4b/ccs_config_cesm.git branch izumi_gnu_update and will be in an upcoming PR.

jedwards4b avatar Jun 16 '25 18:06 jedwards4b

The ccs_config changes for this are in:

https://github.com/ESMCI/ccs_config_cesm/pull/242

The one ne3pg3 test we have is:

SMS_Ln9.ne3pg3_ne3pg3_mt232.I2000Clm60Sp.derecho_gnu.clm-clm60cam7LndTuningMode--clm-nofireemis

Since this is now becoming more important in CAM testing we should add more tests for it and include one or more on izumi CAM includes testing for this for izumi_nag and izumi_gnu.

ekluzek avatar Jul 03 '25 05:07 ekluzek

Adding next to assess if we should add any izumi_gnu tests to our aux_clm testlist.

ekluzek avatar Jul 03 '25 05:07 ekluzek

In meeting we agree to not add izumi_gnu, but will add izumi_nag. Will also bring this up in CSEG.

ekluzek avatar Jul 03 '25 16:07 ekluzek

OK, we still need to update to a later version of ccs_config for this to work.

There is CESM alpha testing going on, and we should update the cime and ccs_config versions in that, once that's completed. I'll create a subissue for doing that.

ekluzek avatar Aug 13 '25 19:08 ekluzek

We discussed this at the CTSM SE meeting this morning. We have ccs_config updates planned to come in. I'll create a subissue for adding the tests we proposed.

ekluzek avatar Aug 22 '25 16:08 ekluzek