CTSM
CTSM copied to clipboard
NEON test failing on izumi in ctsm5.1dev095
Brief summary of bug
The NEON test is failing on izumi starting in ctsm5.1.dev095 with externals updates.
General bug information
CTSM version you are using: ctsm5.1.dev095
Does this bug cause significantly incorrect results in the model's science? No
Configurations affected: NEON test case
Details of bug
SMS_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Bgc.izumi_nag.clm-default--clm-NEON-NIWO
Important output or errors that show the problem
There is nothing in the cesm.log file. The lnd.log and med.log terminate early in initialization The batch output says ending on signal 15.
@negin513 pointed out to me that perhaps @briandobbins changes need to come in here. And indeed when I set
./xmlchange PIO_REARRANGER=2
the test passes. I think this change should be put into the NEON user-mod so that it will apply whether run_neon.py is used or not.
This also seems to be needed for single-point cases in general. I wasn't getting too many fails, but a single-point (generic, using subset global DATM data) I am running is taking an incredibly long time to run.
In chatting with @briandobbins it is going quite fast during the actual simulation (~800 days per year) but then slows way down when it writes restart files (~4 days per year).
Brian suggested updating the same PIO_REARRANGER_LND
parameter and I have done this. Theoretically the next time the case submits it should go faster so I can report back on whether this fixed my issue.
Brian suggested updating the same
PIO_REARRANGER_LND
parameter and I have done this. Theoretically the next time the case submits it should go faster so I can report back on whether this fixed my issue.
Can confirm much faster after I added this. So I think we need to add this to all single-point runs?
It seems, per another thread, there's some uncertainty about this -- I'm going to look into it next week in more detail. Did it actually make a big difference in your runs, Adrianna? From what I saw, it looked like it didn't change much, per the timing files.
Anyway, let's talk about it, as there's some confusion about this.
- Brian
On Thu, May 19, 2022 at 9:10 AM Adrianna Foster @.***> wrote:
Brian suggested updating the same PIO_REARRANGER_LND parameter and I have done this. Theoretically the next time the case submits it should go faster so I can report back on whether this fixed my issue.
Can confirm much faster after I added this. So I think we need to add this to all single-point runs?
— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1743#issuecomment-1131850628, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACL2HPPGVUVIPGPMPXKTN5DVKZKX5ANCNFSM5VNYRDCA . You are receiving this because you were mentioned.Message ID: @.***>
I am looking into this now. Seems to be a memory corruption issue.
I've found the error in the NEON test - for NEON I created reduced sized lnfm files to reduce the amount of data transfer required for NEON datasets.
/glade/p/cesmdata/cseg/inputdata/atm/datm7/NASA_LIS/clmforc.Li_2016_climo1995-2013.360x720.lnfm_Total_NEONarea_c210625.nc
However the mesh file used is a full global mesh
/glade/p/cesmdata/cseg/inputdata/atm/datm7/NASA_LIS/clmforc.Li_2016_climo1995-2013.360x720_ESMFmesh_cdf5_150621.nc
You need to either generate a reduced mesh file or use the original lightning file.
We should do better error checking here: https://github.com/ESCOMP/CDEPS/issues/164
This came up again, but I see that @jedwards4b gave us the solution of getting the right mesh file to go with the reduced lightning file. We do have a script to modify mesh files for a new grid now. A future version of CDEPS will have better error checking and notice this problem, but we aren't there yet, and we'll still need a new mesh file to use.
OK, now I see this thread... can we generate this mesh file with the tools we have now, @ekluzek ? @negin513 can't even run 200 years of AD spinup without running out of wall clock time, which seems pretty slow.
@wwieder yes I should be able to figure out the mesh tools in order to modify it as needed for the lightning file. I'll start working on that as part of the sparse grid work.
This is working now, and we don't have to set the PIO rearranger anymore. And we have the mesh file in place. So I think we've resolved everything around this.