CTSM icon indicating copy to clipboard operation
CTSM copied to clipboard

Update Externals.cfg to cesm2_3_beta17 and remove mct

Open slevis-lmwg opened this issue 9 months ago • 62 comments

Description of changes

Specifics listed in issues #2493 upd. externals to beta17 #2294 remove mct, but not entirely, so I'm removing this issue from the "fixed" list #2546 fix error in cam4/cam5 test (unrelated) #2279 Retire the /test/tools framework for CESM test system custom tests that do the same thing

Specific notes

Contributors other than yourself, if any: @ekluzek @jedwards4b @billsacks

CTSM Issues Fixed (include github issue #): Fixes #2493 Fixes #2546 Fixes #2279

Are answers expected to change (and if so in what way)? No

Any User Interface Changes (namelist or namelist defaults changes)? Yes, and it was done.

Does this create a need to change or add documentation? Did you do so? I don't think so.

Testing performed, if any: To play safe, I will run the following tests: PASS ./build-namelist_test.pl PASS python tests -u and -s PASS make black (make lint gives minor complaint but perfect score) OK aux_clm on derecho (first time result; see subsequent results below)

slevis-lmwg avatar May 09 '24 23:05 slevis-lmwg

Next I will go through the checklist in #2294 and rerun tests.

slevis-lmwg avatar May 09 '24 23:05 slevis-lmwg

git grep -i mct returns very little stuff now.

Question: May I remove (or rename) this:

src/main/glc2lndMod.F90:     procedure, public  :: set_glc2lnd_fields_mct   ! set coupling fields sent from glc to lnd
src/main/glc2lndMod.F90:  subroutine set_glc2lnd_fields_mct(this, bounds, glc_present, x2l, &
src/main/glc2lndMod.F90:    character(len=*), parameter :: subname = 'set_glc2lnd_fields_mct'
src/main/glc2lndMod.F90:  end subroutine set_glc2lnd_fields_mct

slevis-lmwg avatar May 10 '24 22:05 slevis-lmwg

aux_clm derecho FAIL, the two baseline diffs may be due to the new PE layouts, though the latter are different for all the tests. The first three failures seem caused by the new externals. I will review these with Erik.

FAIL FUNITCTSM_P1x1.f10_f10_mg37.I2000Clm50Sp.derecho_intel RUN time=16
FAIL LILACSMOKE_D_Ld2.f10_f10_mg37.I2000Ctsm50NwpSpAsRs.derecho_intel.clm-lilac RUN time=0
FAIL SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop SETUP
FAIL SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop BASELINE ctsm5.2.004: DIFF
FAIL SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen BASELINE ctsm5.2.004: DIFF

izumi FAIL Several nag tests fail with

Runtime Error: /fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_restart_mod.F90, line 350: INTEGER(int32) overflow for 538976288 * 538976288
Program terminated by fatal error
/fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_restart_mod.F90, line 350: Error occurred in MED_PHASES_RESTART_MOD:MED_PHASES_RESTART_WRITE
/fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../cesm/driver/esmApp.F90, line 141: Called by ESMAPP
[i027.cgd.ucar.edu:mpi_rank_0][error_sighandler] Caught error: Aborted (signal 6)

slevis-lmwg avatar May 10 '24 23:05 slevis-lmwg

From meeting with @ekluzek

  1. For the SETUP failure above, bisect between ccs_config_cesm0.0.92 and 106 until I find where it first fails. Note:
  • This test was expected to fail in SHAREDLIB_BUILD
  • the SETUP error starts with version 99

0.0.92 FAIL SHAREDLIB_BUILD 0.0.98 FAIL same as previous 0.0.99 FAIL same as next 0.0.106 FAIL SETUP see /glade/work/slevis/git/latest_master/tests_0510-170901de/SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_nvhpc.clm-crop.GC.0510-170901de_nvh/TestStatus.log

@fischer-ncar Erik suggested that I bring this to your attention and possibly open issues in ccs_config and ctsm. Let me know if my diagnosis does not provide enough info and/or whether you would like to discuss in a meeting. Feel free to contact me with questions.

UPDATE: This test now continues as it used to and fails in the SHAREDLIB_BUILD phase.

  1. Try similar approach for all the failures:
  • On izumi it's probably cmeps cmeps0.14.50 PASS cmeps0.14.51 FAIL new error
Error: /fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_cdeps_mod.F90, line 10: Cannot find symbol ESMF_GRIDCOMPGETINTERNALSTATE in module ESMF
        
        ERROR: BUILD FAIL: buildexe failed, cat /scratch/cluster/slevis/ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold.C.20240515_143052_sm2ymy/bld/cesm.bldlog.240515-145356

cmeps0.14.58 FAIL same as previous cmeps0.14.59 FAIL same as next cmeps0.14.60 FAIL error appears in the prev. post.

  • For FUNIT, I had to bring back a CMakeLists.txt file that I had removed
  • For LILAC, this far I have confirmed that it's not ccs_config (by running with version 99 as above) and that it's not cime (by running cime6.0.217_httpsbranch03); I backed out various changes and continued to get the same error; Erik suggested this, but I still got the same error. Jim E. posted that I should not have removed mct, so I will reverse that commit and try again. (UPDATE about LILAC appears below, based on which version 99 here is not likely a typo, and I may have confused myself with these earlier tests.)

slevis-lmwg avatar May 14 '24 20:05 slevis-lmwg

@slevis-lmwg the SETUP failure looks like an issue with the debug libraries missing for ESMF and pio for nvhpc/24.3.

cannot be loaded as requested: "parallelio/2.6.2-debug", "esmf/8.6.0-debug"

@jedwards4b should these libraries be available, or do we need to switch to the non-debug versions?

fischer-ncar avatar May 15 '24 18:05 fischer-ncar

@slevis-lmwg I am currently working on the parallelio-debug issue, hope to have a resolution soon.

jedwards4b avatar May 15 '24 22:05 jedwards4b

@slevis-lmwg although you can remove cpl7 you cannot yet remove mct. However I don't expect you to remove any externals in this tag, I will remove them in the next tag.

jedwards4b avatar May 15 '24 22:05 jedwards4b

Ok, thanks @jedwards4b I will put back mct then.

slevis-lmwg avatar May 15 '24 22:05 slevis-lmwg

@slevis-lmwg the issue (cannot be loaded as requested: "parallelio/2.6.2-debug", "esmf/8.6.0-debug") should now be fixed, please try again.

jedwards4b avatar May 16 '24 14:05 jedwards4b

@slevis-lmwg the issue (cannot be loaded as requested: "parallelio/2.6.2-debug", "esmf/8.6.0-debug") should now be fixed, please try again.

I confirmed. Thanks @jedwards4b.

slevis-lmwg avatar May 16 '24 15:05 slevis-lmwg

@jedwards4b @fischer-ncar another error that I would like to run by you... I will summarize my earlier posts here.

Izumi nag tests fail. I tried the "bisect" method with cmeps versions and got the following results: cmeps0.14.50 PASS cmeps0.14.51 FAIL with different error than version cmeps0.14.60, so probably do NOT focus on this one for now?

Error: /fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_cdeps_mod.F90, line 10: Cannot find symbol ESMF_GRIDCOMPGETINTERNALSTATE in module ESMF
        
        ERROR: BUILD FAIL: buildexe failed, cat /scratch/cluster/slevis/ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold.C.20240515_143052_sm2ymy/bld/cesm.bldlog.240515-145356

cmeps0.14.58 FAIL same as previous cmeps0.14.59 FAIL same as next cmeps0.14.60 FAIL with this error message:

Runtime Error: /fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_restart_mod.F90, line 350: INTEGER(int32) overflow for 538976288 * 538976288
Program terminated by fatal error
/fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../mediator/med_phases_restart_mod.F90, line 350: Error occurred in MED_PHASES_RESTART_MOD:MED_PHASES_RESTART_WRITE
/fs/cgd/data0/slevis/git/latest_master_new/components/cmeps/cime_config/../cesm/driver/esmApp.F90, line 141: Called by ESMAPP
[i027.cgd.ucar.edu:mpi_rank_0][error_sighandler] Caught error: Aborted (signal 6)

slevis-lmwg avatar May 16 '24 17:05 slevis-lmwg

@slevis-lmwg can you point me to the case directory for your latest failure please.

jedwards4b avatar May 16 '24 17:05 jedwards4b

@slevis-lmwg can you point me to the case directory for your latest failure please.

This test ran with cmeps0.14.59 and gave the same error as cmeps0.14.60: /scratch/cluster/slevis/ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold.C.20240515_150105_j90e50 The test with cmeps0.14.60: /fs/cgd/data0/slevis/git/latest_master_new/tests_0510-174505iz/ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold.GC.0510-174505iz_nag

slevis-lmwg avatar May 16 '24 18:05 slevis-lmwg

I copied your source tree and recreated the test ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold in /scratch/cluster/jedwards/ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.izumi_nag.clm-FatesCold.20240516_125746_fila6h. everything passes. I also manually compared to the baseline in /fs/cgd/csm/ccsm_baselines/ctsm5.2.005/ since I did not specify it in the test - that also passes. Is your error repeatable?

jedwards4b avatar May 16 '24 19:05 jedwards4b

Very sorry @jedwards4b I see why it worked for you:

  • I thought you would checkout the branch from the PR
  • Instead you ended up with my cmeps0.14.50 test, which passes
  • Copy my Externals.cfg again to get it with cmeps0.14.60 Now I would expect the failure to repeat.

slevis-lmwg avatar May 16 '24 20:05 slevis-lmwg

@billsacks just confirming that you recommend removing /lilac/CMakeLists.txt as not needed.

slevis-lmwg avatar May 16 '24 21:05 slevis-lmwg

@jedwards4b back to the question of NOT removing mct, I need clarification:

  • In my local branch I have reverted the removal of /src/cpl/mct and I am ready to push to here
  • Do you need me to revert anything else mct-related or other?
  • Also to confirm beyond doubt: the instruction to remove mct that I followed in this post was mistaken?

slevis-lmwg avatar May 16 '24 22:05 slevis-lmwg

just confirming that you recommend removing /lilac/CMakeLists.txt as not needed.

I am almost positive. That said, I know there are other no-longer-needed things in the lilac directory so it's possible it would be easiest to remove everything at once in case there are cross-references between them that help see what isn't needed.

billsacks avatar May 16 '24 22:05 billsacks

LILAC test failure:

  • PASS Test with ctsm5.2.004 (no change) OR change all externals except keep ccs_config version 92 or 93 or 94
  • FAIL Change all externals except keep ccs_config version 95
  • FAIL with same error, tests with cf1a29786 (change all externals) and later (ccce06198, db147cd61, 3d1ab9be7)

slevis-lmwg avatar May 16 '24 23:05 slevis-lmwg

I have a fix for the the izumi nag test failures. https://github.com/ESCOMP/CMEPS/pull/460 I think that you can mark those as known failures for now - they only occur with nag in debug mode due to a math operation on a variable that is not used again.

jedwards4b avatar May 17 '24 16:05 jedwards4b

Ok, thanks @jedwards4b, sounds good.

slevis-lmwg avatar May 17 '24 17:05 slevis-lmwg

@jedwards4b I have now figured out when the LILAC test starts failing and would like to run it by you as well (@billsacks yesterday I had confused myself with test permutations):

  • PASS Change all externals except keep ccs_config version 94
  • FAIL Change all externals except keep ccs_config version 95. This failing test is here: /glade/derecho/scratch/slevis/LILACSMOKE_D_Ld2.f10_f10_mg37.I2000Ctsm50NwpSpAsRs.derecho_intel.clm-lilac.C.20240517_144532_pmf0rk Code is here: /glade/work/slevis/git/latest_master git branch: upd_externals_to_beta17 git describe: ctsm5.2.004-19-ge4544510e git diff:
diff --git a/Externals.cfg b/Externals.cfg
index a8a77a40f..e8a4c0d85 100644
--- a/Externals.cfg
+++ b/Externals.cfg
@@ -29,12 +29,12 @@ required = True
 [mizuRoute]
 local_path = components/mizuRoute
 protocol = git
-repo_url = https://github.com/ESCOMP/mizuRoute
-hash = 81c720c
+repo_url = https://github.com/nmizukami/mizuRoute
+hash = 34723c2
 required = True
 
 [ccs_config]
-tag = ccs_config_cesm0.0.106
+tag = ccs_config_cesm0.0.95
 protocol = git
 repo_url = https://github.com/ESMCI/ccs_config_cesm.git
 local_path = ccs_config

slevis-lmwg avatar May 17 '24 21:05 slevis-lmwg

NOTE: Tests ran with old mizuRoute. I am not pushing that back to the PR until I discuss with Erik.

aux_clm on izumi OK (now includes the long new list of izumi nag debug EXPECTED FAILURES) aux_clm on derecho IN PROGRESS

Other derecho tests: PASS make black and make lint PASS ./run_ctsm_py_tests -u PASS ./run_ctsm_py_tests -s PASS ./build-namelist_test.pl

slevis-lmwg avatar May 17 '24 22:05 slevis-lmwg

The most likely change I see in ccs_config tag 095 is the removal of this line:

<directive> -V </directive>

from the <batch_system type="pbs" > block. @slevis-lmwg can you try restoring that and seeing if it solves the problem? It looks like that controls exporting environment variables (https://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html). I'm not sure off-hand why that would matter in the LILAC test differently from other tests / cases, but if that fixes the issue we could try to dig further.

billsacks avatar May 17 '24 23:05 billsacks

That was it, thank you @billsacks. Do you recommend I open an issue? If so, in ctsm and/or elsewhere?

I will update to the latest ctsm tag now and resume testing on derecho.

slevis-lmwg avatar May 18 '24 00:05 slevis-lmwg

@billsacks @slevis-lmwg the -V option was removed because it was causing a number of problems. I think that it would be good if you could identify what environment variable lilac is using that needs to be exported then we can explicitly export it.
You might be able to do this by comparing the file software_environment.txt from a passing to a failing case.

jedwards4b avatar May 18 '24 01:05 jedwards4b

The diffs that I see in software_environment.txt (I tried removing false diffs that were due to changing line numbers):

2,6c2,5
<   1) cesmdev/1.0    (H,S)   6) ncarcompilers/1.0.0  11) parallel-netcdf/1.12.3
<   2) ncarenv/23.09  (S)     7) cmake/3.26.3         12) parallelio/2.6.2-debug
<   3) craype/2.7.23          8) cray-mpich/8.1.27    13) esmf/8.6.0-debug
<   4) intel/2023.2.1         9) hdf5-mpi/1.12.2
<   5) mkl/2023.2.0          10) netcdf-mpi/4.9.2
---
>   1) cesmdev/1.0    (H,S)   5) mkl/2023.2.0          9) hdf5-mpi/1.12.2         13) esmf/8.6.0-debug
>   2) ncarenv/23.09  (S)     6) ncarcompilers/1.0.0  10) netcdf-mpi/4.9.2
>   3) craype/2.7.23          7) cmake/3.26.3         11) parallel-netcdf/1.12.3
>   4) intel/2023.2.1         8) cray-mpich/8.1.27    12) parallelio/2.6.2-debug
76c75
< GPG_TTY=/dev/pts/147
---
> GPG_TTY=/dev/pts/145
131d129
< XTERM_SHELL=/usr/bin/bash
143c141
< _ModuleTable012_=ZS93b3JrL3NsZXZpcy9zcGFjay1kb3duc3RyZWFtcy9kZXJlY2hvL21vZHVsZXMvMjMuMDkvY3JheS1tcGljaC84LjEuMjcvb25lYXBpLzIwMjMuMi4xIiwgIi9nbGFkZS91L2FwcHMvZGVyZWNoby9tb2R1bGVzLzIzLjA5L2NyYXktbXBpY2gvOC4xLjI3L29uZWFwaS8yMDIzLjIuMSIsCn0sCnN5c3RlbUJhc2VNUEFUSCA9ICIvZ2xhZGUvdS9hcHBzL2RlcmVjaG8vbW9kdWxlcy9lbnZpcm9ubWVudCIsCn0K
---
> _ModuleTable012_=d29yay9zbGV2aXMvc3BhY2stZG93bnN0cmVhbXMvZGVyZWNoby9tb2R1bGVzLzIzLjA5L2NyYXktbXBpY2gvOC4xLjI3L29uZWFwaS8yMDIzLjIuMSIsICIvZ2xhZGUvdS9hcHBzL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9jcmF5LW1waWNoLzguMS4yNy9vbmVhcGkvMjAyMy4yLjEiLAp9LApzeXN0ZW1CYXNlTVBBVEggPSAiL2dsYWRlL3UvYXBwcy9kZXJlY2hvL21vZHVsZXMvZW52aXJvbm1lbnQiLAp9Cg==
212c210
< _ModuleTable010_=cy8yMy4wOS9jcmF5LW1waWNoLzguMS4yNy9vbmVhcGkvMjAyMy4yLjEvcGFyYWxsZWxpby8yLjYuMi1kZWJ1Zy5sdWEiLApmdWxsTmFtZSA9ICJwYXJhbGxlbGlvLzIuNi4yLWRlYnVnIiwKbG9hZE9yZGVyID0gMTIsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAicGFyYWxsZWxpby8yLjYuMi1kZWJ1ZyIsCndWID0gIl4wMDAwMDAwMi4wMDAwMDAwMDYuMDAwMDAwMDAyLipkZWJ1Zy4qemZpbmFsIiwKfSwKfSwKbXBhdGhBID0gewoiL2dsYWRlL3UvYXBwcy9kZXJlY2hvL21vZHVsZXMvZW52aXJvbm1lbnQiCiwgIi9nbGFkZS91L2FwcHMvY3NlZy9kZXJlY2hvL21vZHVsZXMvMjMuMDkvQ29yZSIKLCAiL2dsYWRlL3dv
---
> _ModuleTable010_=cy8yMy4wOS9jcmF5LW1waWNoLzguMS4yNy9vbmVhcGkvMjAyMy4yLjEvcGFyYWxsZWxpby8yLjYuMi1kZWJ1Zy5sdWEiLApmdWxsTmFtZSA9ICJwYXJhbGxlbGlvLzIuNi4yLWRlYnVnIiwKbG9hZE9yZGVyID0gMTIsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAicGFyYWxsZWxpby8yLjYuMi1kZWJ1ZyIsCndWID0gIl4wMDAwMDAwMi4wMDAwMDAwMDYuMDAwMDAwMDAyLipkZWJ1Zy4qemZpbmFsIiwKfSwKfSwKbXBhdGhBID0gewoiL2dsYWRlL3UvYXBwcy9kZXJlY2hvL21vZHVsZXMvZW52aXJvbm1lbnQiLCAiL2dsYWRlL3UvYXBwcy9jc2VnL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9Db3JlIgosICIvZ2xhZGUvd29y
220d217
< XTERM_VERSION=XTerm(330)
281c278
< SHLVL=4
---
> SHLVL=3
294d290
< WINDOWID=2621453
334c330
< _ModuleTable011_=cmsvc2xldmlzL3NwYWNrLWRvd25zdHJlYW1zL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9Db3JlIgosICIvZ2xhZGUvdS9hcHBzL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9Db3JlIgosICIvZ2xhZGUvdS9hcHBzL2NzZWcvZGVyZWNoby9tb2R1bGVzLzIzLjA5L29uZWFwaS8yMDIzLjIuMSIKLCAiL2dsYWRlL3dvcmsvc2xldmlzL3NwYWNrLWRvd25zdHJlYW1zL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9vbmVhcGkvMjAyMy4yLjEiCiwgIi9nbGFkZS91L2FwcHMvZGVyZWNoby9tb2R1bGVzLzIzLjA5L29uZWFwaS8yMDIzLjIuMSIKLCAiL2dsYWRlL3UvYXBwcy9jc2VnL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9jcmF5LW1waWNoLzguMS4yNy9vbmVhcGkvMjAyMy4yLjEiCiwgIi9nbGFk
---
> _ModuleTable011_=ay9zbGV2aXMvc3BhY2stZG93bnN0cmVhbXMvZGVyZWNoby9tb2R1bGVzLzIzLjA5L0NvcmUiLCAiL2dsYWRlL3UvYXBwcy9kZXJlY2hvL21vZHVsZXMvMjMuMDkvQ29yZSIKLCAiL2dsYWRlL3UvYXBwcy9jc2VnL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9vbmVhcGkvMjAyMy4yLjEiCiwgIi9nbGFkZS93b3JrL3NsZXZpcy9zcGFjay1kb3duc3RyZWFtcy9kZXJlY2hvL21vZHVsZXMvMjMuMDkvb25lYXBpLzIwMjMuMi4xIgosICIvZ2xhZGUvdS9hcHBzL2RlcmVjaG8vbW9kdWxlcy8yMy4wOS9vbmVhcGkvMjAyMy4yLjEiCiwgIi9nbGFkZS91L2FwcHMvY3NlZy9kZXJlY2hvL21vZHVsZXMvMjMuMDkvY3JheS1tcGljaC84LjEuMjcvb25lYXBpLzIwMjMuMi4xIgosICIvZ2xhZGUv
349d344
< XTERM_LOCALE=en_US.UTF-8

slevis-lmwg avatar May 18 '24 01:05 slevis-lmwg

I will likely be offline from now until Tuesday or Wednesday.

slevis-lmwg avatar May 18 '24 01:05 slevis-lmwg

The only significant difference is in the xterm variables - but I don't have any idea why lilac might care about that?

jedwards4b avatar May 18 '24 01:05 jedwards4b

git grep -i mct returns very little stuff now.

Question: May I remove (or rename) this:

src/main/glc2lndMod.F90:     procedure, public  :: set_glc2lnd_fields_mct   ! set coupling fields sent from glc to lnd
src/main/glc2lndMod.F90:  subroutine set_glc2lnd_fields_mct(this, bounds, glc_present, x2l, &
src/main/glc2lndMod.F90:    character(len=*), parameter :: subname = 'set_glc2lnd_fields_mct'
src/main/glc2lndMod.F90:  end subroutine set_glc2lnd_fields_mct

@slevis-lmwg we should just remove this. And of course make sure everything works without it. It should, and if not it's likely an easy thing to fix that we'd want to do anyway.

ekluzek avatar May 20 '24 18:05 ekluzek