Fail with DEBUG build running longer `ne30pg2_ne30pg2.F2010-SCREAMv1`
In this case, after 39 days, I get the following fail:
43: corrupted size vs. prev_size
43:
43: Program received signal SIGABRT: Process abort signal.
43:
43: Backtrace for this error:
43: #0 0x148582aac862 in ???
43: #1 0x148582aab8f5 in ???
43: #2 0x14858253cd6f in ???
43: #3 0x14858253ccdb in ???
43: #4 0x14858253e394 in ???
43: #5 0x148582582c37 in ???
43: #6 0x14858258acd9 in ???
43: #7 0x14858258b5a5 in ???
43: #8 0x14858258b722 in ???
43: #9 0x14858258dd3f in ???
43: #10 0x14858258f787 in ???
43: #11 0xe997ad in __canopyhydrologymod_MOD_canopyhydrology
43: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/biogeophys/CanopyHydrologyMod.F90:157
43: #12 0x6cf782 in __elm_driver_MOD_elm_drv._omp_fn.4
43: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/main/elm_driver.F90:1361
43: #13 0x14858fd30c4d in ???
43: #14 0x14858574f6e9 in ???
43: #15 0x14858260a53e in ???
This is standard ne30 case with a different START_DATE and perturbation than default setting. What's odd here is that it fails after quite a bit of simulation. In other DEBUG testing with similar setup, I've seen 4 different error messages (looks to all be in LND or ICE). All cases are ok build in OPT.
I don't have a way of reproducing this with test string (as need to change start date and perturbation), but the case is here:
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/c28-oct15/p.ne30pg2_ne30pg2.F2010-SCREAMv1.c28-oct15.1y.n0016.pert01.DEBUG
Note this is all with 128 vertical levels.
Does this run ok with EAM instead of EAMxx?
I suspect you are asking not only to query if issue in our production eam cases, but also if we can isolate another component. However, it's not clear how to test this in the same way just switching out eamxx->eam. As the resolution is different and there may not be same way of adding perturbations? I could still try to run some longer eam-based ne30 cases in DEBUG.
Here is launch script.
I accidentally removed the original case noted above, but just launched again (same dir name). It fails, but with a different stack trace.
51: corrupted size vs. prev_size
51:
51: Program received signal SIGABRT: Process abort signal.
51:
51: Backtrace for this error:
51: #0 0x150f4a6ac862 in ???
51: #1 0x150f4a6ab8f5 in ???
51: #2 0x150f4a288d6f in ???
51: #3 0x150f4a288cdb in ???
51: #4 0x150f4a28a394 in ???
51: #5 0x150f4a2cec37 in ???
51: #6 0x150f4a2d6cd9 in ???
51: #7 0x150f4a2d75a5 in ???
51: #8 0x150f4a2d7722 in ???
51: #9 0x150f4a2d9d3f in ???
51: #10 0x150f4a2db787 in ???
51: #11 0x150f4a6abbb8 in ???
51: #12 0x150f4a917937 in ???
51: #13 0x150f4a928db7 in ???
51: #14 0x4cd7792 in __shr_strconvert_mod_MOD_i4tostring
51: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/share/util/shr_strconvert_mod.F90:74
51: #15 0x4c37397 in __shr_log_mod_MOD_shr_log_errmsg
51: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/share/util/shr_log_mod.F90:78
51: #16 0x179d2ae in __dynsubgridcontrolmod_MOD_get_for_testing_zero_dynbal_fluxes
51: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/dyn_subgrid/dynSubgridControlMod.F90:330
51: #17 0x1e28ebe in __dynconsbiogeophysmod_MOD_dyn_hwcontent_final
51: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/dyn_subgrid/dynConsBiogeophysMod.F90:159
51: #18 0x179f59c in __dynsubgriddrivermod_MOD_dynsubgrid_driver._omp_fn.1
51: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/dyn_subgrid/dynSubgridDriverMod.F90:322
it could be a value goes wonky and is caught in different places (though might expect it to be more reproducible). Or could be a generic memory corruption problem that might show up in a few places (and not always be exactly reproducible).
The current flags for DEBUG are -O0 -g -fbacktrace -fcheck=bounds -ffpe-trap=zero,overflow
I tried to rebuild without additional checks, and it still fails. ie DEBUG with only
-O0 -g -fbacktrace -g
34: corrupted size vs. prev_size
34:
34: Program received signal SIGABRT: Process abort signal.
34:
34: Backtrace for this error:
34: #0 0x14d1940ac862 in ???
34: #1 0x14d1940ab8f5 in ???
34: #2 0x14d193b3cd6f in ???
34: #3 0x14d193b3ccdb in ???
34: #4 0x14d193b3e394 in ???
34: #5 0x14d193b82c37 in ???
34: #6 0x14d193b8acd9 in ???
34: #7 0x14d193b8b5a5 in ???
34: #8 0x14d193b8b722 in ???
34: #9 0x14d193b8dd3f in ???
34: #10 0x14d193b8f787 in ???
34: #11 0x9b0f5e in __canopyhydrologymod_MOD_canopyhydrology
34: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/biogeophys/CanopyHydrologyMod.F90:157
34: #12 0x5e2b21 in __elm_driver_MOD_elm_drv._omp_fn.4
34: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/main/elm_driver.F90:704
34: #13 0x14d1a13a6c4d in ???
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/c28-oct15/p.ne30pg2_ne30pg2.F2010-SCREAMv1.c28-oct15.1y.n0016.pert01.DEBUG.nochecknotrap
Then I tried adding more checks -fcheck-all and I do see some warnings (can anyone tell if these are of concern?), but it fails with similar error pasted below.
61: At line 515 of file /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/homme/src/share/scalable_grid_init_mod.F90
61: Fortran runtime warning: An array temporary was created for argument 'pos' of procedure 'sfcpos2ui'
61: At line 251 of file /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/homme/src/share/cube_mod.F90
61: Fortran runtime warning: An array temporary was created for argument 'd' of procedure 'dmap'
61: At line 489 of file /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/main/lnd2atmMod.F90
61: Fortran runtime warning: An array temporary was created for argument 'tsoil_' of procedure 'avg_tsoil_surf'
61: At line 490 of file /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/main/lnd2atmMod.F90
61: Fortran runtime warning: An array temporary was created for argument 'tsoil_' of procedure 'avg_tsoil'
61: At line 3510 of file /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/main/histFileMod.F90
61: Fortran runtime warning: An array temporary was created
eventually failing with
62: #0 0x152fb46ac862 in ???
62: #1 0x152fb46ab8f5 in ???
62: #2 0x152fb413cd6f in ???
62: #3 0x152fb413ccdb in ???
62: #4 0x152fb413e394 in ???
62: #5 0x152fb4182c37 in ???
62: #6 0x152fb418acd9 in ???
62: #7 0x152fb418b5a5 in ???
62: #8 0x152fb418e39d in ???
62: #9 0x152fb418f787 in ???
62: #10 0x1073dee in __snowsnicarmod_MOD_snicar_ad_rt
62: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/biogeophys/SnowSnicarMod.F90:1888
62: #11 0x1232383 in __surfacealbedomod_MOD_surfacealbedo
62: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/biogeophys/SurfaceAlbedoMod.F90:612
62: #12 0x6ed253 in __elm_driver_MOD_elm_drv._omp_fn.4
62: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/c28-oct15/components/elm/src/main/elm_driver.F90:1361
62: #13 0x152fbe1b0c4d in ???
62: #14 0x152fb76e86e9 in ???
Also tried with newer gnu flag -fsanitize=address and i see link errors. Have not experimented with these flags.
Building with -O1 instead of -O0 allows this case to complete 1 year.
-O1 -g -fbacktrace -fcheck=all -ffpe-trap=zero,overflow
However, as it's not BFB with -O0, it could be that the path taken when -O0 is thrown is the issue.
Due to our makefilemess, there are still many fortran files that are built with -O0 as different places add different flags.
I ran this case again with GNU 12.3, just to see if that had impact and I still see error:
46: Backtrace for this error:
46: #0 0x14ef1beac862 in ???
46: #1 0x14ef1beab8f5 in ???
46: #2 0x14ef1b6b3d6f in ???
46: #3 0x14ef1b6b3cdb in ???
46: #4 0x14ef1b6b5394 in ???
46: #5 0x14ef1b6f9c37 in ???
46: #6 0x14ef1b701cd9 in ???
46: #7 0x14ef1b701fab in ???
46: #8 0x4d5d18f in __shr_log_mod_MOD_shr_log_errmsg
46: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/pr/ndkmf-perlmutter-revert-gnu-compiler-version/share/util/shr_log_mod.F90:78
46: #9 0x8177cd in setfiltersonegroup
46: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/pr/ndkmf-perlmutter-revert-gnu-compiler-version/components/elm/src/main/filterMod.F90:299
46: #10 0x81d0b4 in __filtermod_MOD_setfilters
46: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/pr/ndkmf-perlmutter-revert-gnu-compiler-version/components/elm/src/main/filterMod.F90:258
46: #11 0x923ea2 in __reweightmod_MOD_reweight_wrapup
46: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/pr/ndkmf-perlmutter-revert-gnu-compiler-version/components/elm/src/main/reweightMod.F90:53
46: #12 0x1836cbe in __dynsubgriddrivermod_MOD_dynsubgrid_wrapup_weight_changes
46: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/pr/ndkmf-perlmutter-revert-gnu-compiler-version/components/elm/src/dyn_subgrid/dynSubgridDriverMod.F90:403
46: #13 0x183785a in __dynsubgriddrivermod_MOD_dynsubgrid_driver._omp_fn.1
46: at /dvs_ro/cfs/cdirs/e3sm/ndk/repos/pr/ndkmf-perlmutter-revert-gnu-compiler-version/components/elm/src/dyn_subgrid/dynSubgridDriverMod.F90:312
46: #14 0x14ef291a1c4d in ???
46: #15 0x14ef1ef156e9 in ???
46: #16 0x14ef1b78153e in ???
/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/ndkmf-perlmutter-revert-gnu-compiler-version/p.ne30pg2_ne30pg2.F2010-SCREAMv1.ndkmf-perlmutter-revert-gnu-compiler-version.1y.n0016.pert01.DEBUG