CTSM
CTSM copied to clipboard
Update Externals.cfg to cesm2_3_beta17 and remove mct
Description of changes
Specifics listed in issues #2493 upd. externals to beta17 #2294 remove mct, but not entirely, so I'm removing this issue from the "fixed" list #2546 fix error in cam4/cam5 test (unrelated) #2279 Retire the /test/tools framework for CESM test system custom tests that do the same thing
Specific notes
Contributors other than yourself, if any: @ekluzek @jedwards4b @billsacks
CTSM Issues Fixed (include github issue #): Fixes #2493 Fixes #2546 Fixes #2279
Are answers expected to change (and if so in what way)? No
Any User Interface Changes (namelist or namelist defaults changes)? Yes, and it was done.
Does this create a need to change or add documentation? Did you do so? I don't think so.
Testing performed, if any: To play safe, I will run the following tests: PASS ./build-namelist_test.pl PASS python tests -u and -s PASS make black (make lint gives minor complaint but perfect score) OK aux_clm on derecho (first time result; see subsequent results below)
Next I will go through the checklist in #2294 and rerun tests.
git grep -i mct
returns very little stuff now.
Question: May I remove (or rename) this:
src/main/glc2lndMod.F90: procedure, public :: set_glc2lnd_fields_mct ! set coupling fields sent from glc to lnd
src/main/glc2lndMod.F90: subroutine set_glc2lnd_fields_mct(this, bounds, glc_present, x2l, &
src/main/glc2lndMod.F90: character(len=*), parameter :: subname = 'set_glc2lnd_fields_mct'
src/main/glc2lndMod.F90: end subroutine set_glc2lnd_fields_mct
aux_clm derecho FAIL, the two baseline diffs may be due to the new PE layouts, though the latter are different for all the tests. The first three failures seem caused by the new externals. I will review these with Erik.
FAIL FUNITCTSM_P1x1.f10_f10_mg37.I2000Clm50Sp.derecho_intel RUN time=16
FAIL LILACSMOKE_D_Ld2.f10_f10_mg37.I2000Ctsm50NwpSpAsRs.derecho_intel.clm-lilac RUN time=0
FAIL SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop SETUP
FAIL SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop BASELINE ctsm5.2.004: DIFF
FAIL SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen BASELINE ctsm5.2.004: DIFF
izumi FAIL Several nag tests fail with
Runtime Error: /fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_restart_mod.F90, line 350: INTEGER(int32) overflow for 538976288 * 538976288
Program terminated by fatal error
/fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_restart_mod.F90, line 350: Error occurred in MED_PHASES_RESTART_MOD:MED_PHASES_RESTART_WRITE
/fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../cesm/driver/esmApp.F90, line 141: Called by ESMAPP
[i027.cgd.ucar.edu:mpi_rank_0][error_sighandler] Caught error: Aborted (signal 6)
From meeting with @ekluzek
- For the SETUP failure above, bisect between ccs_config_cesm0.0.92 and 106 until I find where it first fails. Note:
- This test was expected to fail in SHAREDLIB_BUILD
- the SETUP error starts with version 99
0.0.92 FAIL SHAREDLIB_BUILD
0.0.98 FAIL same as previous
0.0.99 FAIL same as next
0.0.106 FAIL SETUP see /glade/work/slevis/git/latest_master/tests_0510-170901de/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.0510-170901de_nvh/TestStatus.log
@fischer-ncar Erik suggested that I bring this to your attention and possibly open issues in ccs_config and ctsm. Let me know if my diagnosis does not provide enough info and/or whether you would like to discuss in a meeting. Feel free to contact me with questions.
UPDATE: This test now continues as it used to and fails in the SHAREDLIB_BUILD phase.
- Try similar approach for all the failures:
- On izumi it's probably cmeps cmeps0.14.50 PASS cmeps0.14.51 FAIL new error
Error: /fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_cdeps_mod.F90, line 10: Cannot find symbol ESMF_GRIDCOMPGETINTERNALSTATE in module ESMF
ERROR: BUILD FAIL: buildexe failed, cat /scratch/cluster/slevis/ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold.C.20240515_143052_sm2ymy/bld/cesm.bldlog.240515-145356
cmeps0.14.58 FAIL same as previous cmeps0.14.59 FAIL same as next cmeps0.14.60 FAIL error appears in the prev. post.
- For FUNIT, I had to bring back a CMakeLists.txt file that I had removed
- For LILAC, this far I have confirmed that it's not ccs_config (by running with version 99 as above) and that it's not cime (by running cime6.0.217_httpsbranch03); I backed out various changes and continued to get the same error; Erik suggested this, but I still got the same error. Jim E. posted that I should not have removed mct, so I will reverse that commit and try again. (UPDATE about LILAC appears below, based on which version 99 here is not likely a typo, and I may have confused myself with these earlier tests.)
@slevis-lmwg the SETUP failure looks like an issue with the debug libraries missing for ESMF and pio for nvhpc/24.3.
cannot be loaded as requested: "parallelio/2.6.2-debug", "esmf/8.6.0-debug"
@jedwards4b should these libraries be available, or do we need to switch to the non-debug versions?
@slevis-lmwg I am currently working on the parallelio-debug issue, hope to have a resolution soon.
@slevis-lmwg although you can remove cpl7 you cannot yet remove mct. However I don't expect you to remove any externals in this tag, I will remove them in the next tag.
Ok, thanks @jedwards4b I will put back mct then.
@slevis-lmwg the issue (cannot be loaded as requested: "parallelio/2.6.2-debug", "esmf/8.6.0-debug") should now be fixed, please try again.
@slevis-lmwg the issue (cannot be loaded as requested: "parallelio/2.6.2-debug", "esmf/8.6.0-debug") should now be fixed, please try again.
I confirmed. Thanks @jedwards4b.
@jedwards4b @fischer-ncar another error that I would like to run by you... I will summarize my earlier posts here.
Izumi nag tests fail. I tried the "bisect" method with cmeps versions and got the following results: cmeps0.14.50 PASS cmeps0.14.51 FAIL with different error than version cmeps0.14.60, so probably do NOT focus on this one for now?
Error: /fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_cdeps_mod.F90, line 10: Cannot find symbol ESMF_GRIDCOMPGETINTERNALSTATE in module ESMF
ERROR: BUILD FAIL: buildexe failed, cat /scratch/cluster/slevis/ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold.C.20240515_143052_sm2ymy/bld/cesm.bldlog.240515-145356
cmeps0.14.58 FAIL same as previous cmeps0.14.59 FAIL same as next cmeps0.14.60 FAIL with this error message:
Runtime Error: /fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_restart_mod.F90, line 350: INTEGER(int32) overflow for 538976288 * 538976288
Program terminated by fatal error
/fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_restart_mod.F90, line 350: Error occurred in MED_PHASES_RESTART_MOD:MED_PHASES_RESTART_WRITE
/fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../cesm/driver/esmApp.F90, line 141: Called by ESMAPP
[i027.cgd.ucar.edu:mpi_rank_0][error_sighandler] Caught error: Aborted (signal 6)
@slevis-lmwg can you point me to the case directory for your latest failure please.
@slevis-lmwg can you point me to the case directory for your latest failure please.
This test ran with cmeps0.14.59 and gave the same error as cmeps0.14.60:
/scratch/cluster/slevis/ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold.C.20240515_150105_j90e50
The test with cmeps0.14.60:
/fs/cgd/data0/slevis/git/latest_master_new/tests_0510-174505iz/ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold.GC.0510-174505iz_nag
I copied your source tree and recreated the test ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold in /scratch/cluster/jedwards/ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold.20240516_125746_fila6h. everything passes. I also manually compared to the baseline in /fs/cgd/csm/ccsm_baselines/ctsm5.2.005/ since I did not specify it in the test - that also passes. Is your error repeatable?
Very sorry @jedwards4b I see why it worked for you:
- I thought you would checkout the branch from the PR
- Instead you ended up with my cmeps0.14.50 test, which passes
- Copy my Externals.cfg again to get it with cmeps0.14.60 Now I would expect the failure to repeat.
@billsacks just confirming that you recommend removing /lilac/CMakeLists.txt as not needed.
@jedwards4b back to the question of NOT removing mct, I need clarification:
- In my local branch I have reverted the removal of /src/cpl/mct and I am ready to push to here
- Do you need me to revert anything else mct-related or other?
- Also to confirm beyond doubt: the instruction to remove mct that I followed in this post was mistaken?
just confirming that you recommend removing /lilac/CMakeLists.txt as not needed.
I am almost positive. That said, I know there are other no-longer-needed things in the lilac directory so it's possible it would be easiest to remove everything at once in case there are cross-references between them that help see what isn't needed.
LILAC test failure:
- PASS Test with ctsm5.2.004 (no change) OR change all externals except keep ccs_config version 92 or 93 or 94
- FAIL Change all externals except keep ccs_config version 95
- FAIL with same error, tests with cf1a29786 (change all externals) and later (ccce06198, db147cd61, 3d1ab9be7)
I have a fix for the the izumi nag test failures. https://github.com/ESCOMP/CMEPS/pull/460 I think that you can mark those as known failures for now - they only occur with nag in debug mode due to a math operation on a variable that is not used again.
Ok, thanks @jedwards4b, sounds good.
@jedwards4b I have now figured out when the LILAC test starts failing and would like to run it by you as well (@billsacks yesterday I had confused myself with test permutations):
- PASS Change all externals except keep ccs_config version 94
- FAIL Change all externals except keep ccs_config version 95. This failing test is here:
/glade/derecho/scratch/slevis/LILACSMOKE_D_Ld2.f10_f10_mg37.I2000Ctsm50NwpSpAsRs.derecho_intel.clm-lilac.C.20240517_144532_pmf0rk
Code is here: /glade/work/slevis/git/latest_master git branch: upd_externals_to_beta17 git describe: ctsm5.2.004-19-ge4544510e git diff:
diff --git a/Externals.cfg b/Externals.cfg
index a8a77a40f..e8a4c0d85 100644
--- a/Externals.cfg
+++ b/Externals.cfg
@@ -29,12 +29,12 @@ required = True
[mizuRoute]
local_path = components/mizuRoute
protocol = git
-repo_url = https://github.com/ESCOMP/mizuRoute
-hash = 81c720c
+repo_url = https://github.com/nmizukami/mizuRoute
+hash = 34723c2
required = True
[ccs_config]
-tag = ccs_config_cesm0.0.106
+tag = ccs_config_cesm0.0.95
protocol = git
repo_url = https://github.com/ESMCI/ccs_config_cesm.git
local_path = ccs_config
NOTE: Tests ran with old mizuRoute. I am not pushing that back to the PR until I discuss with Erik.
aux_clm on izumi OK (now includes the long new list of izumi nag debug EXPECTED FAILURES) aux_clm on derecho IN PROGRESS
Other derecho tests:
PASS make black
and make lint
PASS ./run_ctsm_py_tests -u
PASS ./run_ctsm_py_tests -s
PASS ./build-namelist_test.pl
The most likely change I see in ccs_config tag 095 is the removal of this line:
<directive> -V </directive>
from the <batch_system type="pbs" >
block. @slevis-lmwg can you try restoring that and seeing if it solves the problem? It looks like that controls exporting environment variables (https://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html). I'm not sure off-hand why that would matter in the LILAC test differently from other tests / cases, but if that fixes the issue we could try to dig further.
That was it, thank you @billsacks. Do you recommend I open an issue? If so, in ctsm and/or elsewhere?
I will update to the latest ctsm tag now and resume testing on derecho.
@billsacks @slevis-lmwg the -V option was removed because it was causing a number of problems. I think that it would be good if you could identify what environment variable lilac is using that needs to be exported then we can explicitly export it.
You might be able to do this by comparing the file software_environment.txt from a passing to a failing case.
The diffs that I see in software_environment.txt (I tried removing false diffs that were due to changing line numbers):
2,6c2,5
< 1) cesmdev/1.0 (H,S) 6) ncarcompilers/1.0.0 11) parallel-netcdf/1.12.3
< 2) ncarenv/23.09 (S) 7) cmake/3.26.3 12) parallelio/2.6.2-debug
< 3) craype/2.7.23 8) cray-mpich/8.1.27 13) esmf/8.6.0-debug
< 4) intel/2023.2.1 9) hdf5-mpi/1.12.2
< 5) mkl/2023.2.0 10) netcdf-mpi/4.9.2
---
> 1) cesmdev/1.0 (H,S) 5) mkl/2023.2.0 9) hdf5-mpi/1.12.2 13) esmf/8.6.0-debug
> 2) ncarenv/23.09 (S) 6) ncarcompilers/1.0.0 10) netcdf-mpi/4.9.2
> 3) craype/2.7.23 7) cmake/3.26.3 11) parallel-netcdf/1.12.3
> 4) intel/2023.2.1 8) cray-mpich/8.1.27 12) parallelio/2.6.2-debug
76c75
< GPG_TTY=/dev/pts/147
---
> GPG_TTY=/dev/pts/145
131d129
< XTERM_SHELL=/usr/bin/bash
143c141
< _ModuleTable012_=ZS93b3JrL3NsZXZpcy9zcGFjay1kb3duc3RyZWFtcy9kZXJlY2hvL21vZHVsZXMvMjMuMDkvY3JheS1tcGljaC84LjEuMjcvb25lYXBpLzIwMjMuMi4xIiwgIi9nbGFkZS91L2FwcHMvZGVyZWNoby9tb2R1bGVzLzIzLjA5L2NyYXktbXBpY2gvOC4xLjI3L29uZWFwaS8yMDIzLjIuMSIsCn0sCnN5c3RlbUJhc2VNUEFUSCA9ICIvZ2xhZGUvdS9hcHBzL2RlcmVjaG8vbW9kdWxlcy9lbnZpcm9ubWVudCIsCn0K
---
> _ModuleTable012_=d29yay9zbGV2aXMvc3BhY2stZG93bnN0cmVhbXMvZGVyZWNoby9tb2R1bGVzLzIzLjA5L2NyYXktbXBpY2gvOC4xLjI3L29uZWFwaS8yMDIzLjIuMSIsICIvZ2xhZGUvdS9hcHBzL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9jcmF5LW1waWNoLzguMS4yNy9vbmVhcGkvMjAyMy4yLjEiLAp9LApzeXN0ZW1CYXNlTVBBVEggPSAiL2dsYWRlL3UvYXBwcy9kZXJlY2hvL21vZHVsZXMvZW52aXJvbm1lbnQiLAp9Cg==
212c210
< _ModuleTable010_=cy8yMy4wOS9jcmF5LW1waWNoLzguMS4yNy9vbmVhcGkvMjAyMy4yLjEvcGFyYWxsZWxpby8yLjYuMi1kZWJ1Zy5sdWEiLApmdWxsTmFtZSA9ICJwYXJhbGxlbGlvLzIuNi4yLWRlYnVnIiwKbG9hZE9yZGVyID0gMTIsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAicGFyYWxsZWxpby8yLjYuMi1kZWJ1ZyIsCndWID0gIl4wMDAwMDAwMi4wMDAwMDAwMDYuMDAwMDAwMDAyLipkZWJ1Zy4qemZpbmFsIiwKfSwKfSwKbXBhdGhBID0gewoiL2dsYWRlL3UvYXBwcy9kZXJlY2hvL21vZHVsZXMvZW52aXJvbm1lbnQiCiwgIi9nbGFkZS91L2FwcHMvY3NlZy9kZXJlY2hvL21vZHVsZXMvMjMuMDkvQ29yZSIKLCAiL2dsYWRlL3dv
---
> _ModuleTable010_=cy8yMy4wOS9jcmF5LW1waWNoLzguMS4yNy9vbmVhcGkvMjAyMy4yLjEvcGFyYWxsZWxpby8yLjYuMi1kZWJ1Zy5sdWEiLApmdWxsTmFtZSA9ICJwYXJhbGxlbGlvLzIuNi4yLWRlYnVnIiwKbG9hZE9yZGVyID0gMTIsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAicGFyYWxsZWxpby8yLjYuMi1kZWJ1ZyIsCndWID0gIl4wMDAwMDAwMi4wMDAwMDAwMDYuMDAwMDAwMDAyLipkZWJ1Zy4qemZpbmFsIiwKfSwKfSwKbXBhdGhBID0gewoiL2dsYWRlL3UvYXBwcy9kZXJlY2hvL21vZHVsZXMvZW52aXJvbm1lbnQiLCAiL2dsYWRlL3UvYXBwcy9jc2VnL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9Db3JlIgosICIvZ2xhZGUvd29y
220d217
< XTERM_VERSION=XTerm(330)
281c278
< SHLVL=4
---
> SHLVL=3
294d290
< WINDOWID=2621453
334c330
< _ModuleTable011_=cmsvc2xldmlzL3NwYWNrLWRvd25zdHJlYW1zL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9Db3JlIgosICIvZ2xhZGUvdS9hcHBzL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9Db3JlIgosICIvZ2xhZGUvdS9hcHBzL2NzZWcvZGVyZWNoby9tb2R1bGVzLzIzLjA5L29uZWFwaS8yMDIzLjIuMSIKLCAiL2dsYWRlL3dvcmsvc2xldmlzL3NwYWNrLWRvd25zdHJlYW1zL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9vbmVhcGkvMjAyMy4yLjEiCiwgIi9nbGFkZS91L2FwcHMvZGVyZWNoby9tb2R1bGVzLzIzLjA5L29uZWFwaS8yMDIzLjIuMSIKLCAiL2dsYWRlL3UvYXBwcy9jc2VnL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9jcmF5LW1waWNoLzguMS4yNy9vbmVhcGkvMjAyMy4yLjEiCiwgIi9nbGFk
---
> _ModuleTable011_=ay9zbGV2aXMvc3BhY2stZG93bnN0cmVhbXMvZGVyZWNoby9tb2R1bGVzLzIzLjA5L0NvcmUiLCAiL2dsYWRlL3UvYXBwcy9kZXJlY2hvL21vZHVsZXMvMjMuMDkvQ29yZSIKLCAiL2dsYWRlL3UvYXBwcy9jc2VnL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9vbmVhcGkvMjAyMy4yLjEiCiwgIi9nbGFkZS93b3JrL3NsZXZpcy9zcGFjay1kb3duc3RyZWFtcy9kZXJlY2hvL21vZHVsZXMvMjMuMDkvb25lYXBpLzIwMjMuMi4xIgosICIvZ2xhZGUvdS9hcHBzL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9vbmVhcGkvMjAyMy4yLjEiCiwgIi9nbGFkZS91L2FwcHMvY3NlZy9kZXJlY2hvL21vZHVsZXMvMjMuMDkvY3JheS1tcGljaC84LjEuMjcvb25lYXBpLzIwMjMuMi4xIgosICIvZ2xhZGUv
349d344
< XTERM_LOCALE=en_US.UTF-8
I will likely be offline from now until Tuesday or Wednesday.
The only significant difference is in the xterm variables - but I don't have any idea why lilac might care about that?
git grep -i mct
returns very little stuff now.Question: May I remove (or rename) this:
src/main/glc2lndMod.F90: procedure, public :: set_glc2lnd_fields_mct ! set coupling fields sent from glc to lnd src/main/glc2lndMod.F90: subroutine set_glc2lnd_fields_mct(this, bounds, glc_present, x2l, & src/main/glc2lndMod.F90: character(len=*), parameter :: subname = 'set_glc2lnd_fields_mct' src/main/glc2lndMod.F90: end subroutine set_glc2lnd_fields_mct
@slevis-lmwg we should just remove this. And of course make sure everything works without it. It should, and if not it's likely an easy thing to fix that we'd want to do anyway.