CTSM icon indicating copy to clipboard operation
CTSM copied to clipboard

fates parameter file auto-build for all tests

Open rgknox opened this issue 1 year ago • 22 comments

Description of changes

This enables the automatic building of the fates parameter file binary for all tests. This calls ncgen from the shell_commands script in the Fates/ testdef folder, to operate on the fates default file that is version controlled.

Specific notes

This implementation is incomplete. In order to get this to work for all tests, I had to place the newly built binaries in the a new folder in the fates source tree. The reason for this is because some of the tests are multi-phase (PEM, ERP, etc). Each of these phases needs access to either the same parameter file, or an exact copy of it. However, the shell_commands (as far as my test show) script is only called the first time, so both parts of the test need access to the same file. Unfortunately, the xml files in both parts of the tests, do not provide any file-paths that are common to phase (I looked pretty thoroughly but maybe missed something), located somewhere on the scratch partition. For instance, they both have different cases, which makes it tough for us to locate the parameter file on the second test, if it has a different case as the first. I also tried using CIME_OUTPUT_ROOT, the sharedlib build location.

There is an xml entry in env_test.xml that is TEST_ARGV. This holds the root folder for the current test environment, and the id of the specific test currently run. With these two bits of information we could place a binary file that is accessible to all phases of a test. However, this information does not seem to be available via xml query at the time we run the shell_commands script.

Another location that might be better than the source, at least for the time being would be to put all these files in the CIME_OUTPUT_ROOT, which is usually just the scratch folder where all cases and tests go. Each parameter file could have the test name and hash in it, to prevent redundancy. The downside is that the root scratch folder starts to fill up.

Contributors other than yourself, if any:

@ekluzek @glemieux @adrifoster

Are answers expected to change (and if so in what way)?

Any User Interface Changes (namelist or namelist defaults changes)?

Testing performed, if any:

rgknox avatar Jan 25 '24 15:01 rgknox

Here is a list of the parameter test file binaries it generates, one fates test creates 5.9M of data:

~/ctsm/src/fates/parameter_files> ls binaries/
ERP_D_Ld3.f19_g17.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold.0124-200432de_int-params.nc
ERP_D_Ld3.f19_g17.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold.20240124_191528_ohyuz4-params.nc.text
ERP_D_Ld3.f19_g17.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold.20240124_193525_2d6yfg-params.nc
ERP_D_P128x2_Ld3.f19_g17.I2000Clm50FatesCru.derecho_intel.clm-FatesCold.0124-200432de_int-params.nc
ERP_Ld3.f09_g17.I2000Clm50FatesRs.derecho_intel.clm-FatesCold.0124-200432de_int-params.nc
ERP_P256x2_Ld30.f45_f45_mg37.I2000Clm51FatesRs.derecho_intel.clm-mimicsFatesCold.0124-200432de_int-params.nc
ERS_D_Ld15.5x5_amazon.I2000Clm50FatesRs.derecho_gnu.clm-FatesColdSeedDisp.0124-200432de_gnu-params.nc
ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.derecho_gnu.clm-FatesColdTwoStreamNoCompFixedBioGeo.0124-200432de_gnu-params.nc
ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesColdTreeDamage.0124-200432de_int-params.nc
ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesColdTwoStream.0124-200432de_int-params.nc
ERS_D_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdLandUse.0124-200432de_int-params.nc
ERS_D_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdLUH2.0124-200432de_int-params.nc
ERS_D_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdPRT2.0124-200432de_int-params.nc
ERS_D_Ld3.f19_g17.I2000Clm50FatesCruRsGs.derecho_gnu.clm-FatesCold.0124-200432de_gnu-params.nc
ERS_D_Ld3.f19_g17.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold.0124-200432de_int-params.nc
ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdHydro.0124-200432de_int-params.nc
ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.derecho_intel.clm-FatesCold.0124-200432de_int-params.nc
ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_gnu.clm-FatesCold.0124-200432de_gnu-params.nc
ERS_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdFixedBiogeo.0124-200432de_int-params.nc
ERS_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdNoComp.0124-200432de_int-params.nc
ERS_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdNoCompFixedBioGeo.0124-200432de_int-params.nc
ERS_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdSizeAgeMort.0124-200432de_int-params.nc
ERS_Ld5.f19_g17.I2000Clm45Fates.derecho_intel.clm-FatesCold.0124-200432de_int-params.nc
ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-Fates.0124-200432de_int-params.nc
ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdLogging.0124-200432de_int-params.nc
ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdNoFire.0124-200432de_int-params.nc
ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdPPhys.0124-200432de_int-params.nc
ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdST3.0124-200432de_int-params.nc
ERS_Ld9.f10_f10_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdCH4Off.0124-200432de_int-params.nc
ERS_Lm13.f10_f10_mg37.I2000Clm50Fates.derecho_gnu.clm-FatesCold.0124-200432de_gnu-params.nc
ERS_Lm13.f45_f45_mg37.I2000Clm50Fates.derecho_intel.clm-FatesColdNoComp.0124-200432de_int-params.nc
ERS_P128x1_Lm25.f10_f10_mg37.I2000Clm51Fates.derecho_intel.clm-FatesColdNoComp.0124-200432de_int-params.nc
PEM_D_Ld15.5x5_amazon.I2000Clm50FatesRs.derecho_gnu.clm-FatesColdSeedDisp.0124-200432de_gnu-params.nc
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.derecho_gnu.clm-FatesPRISM--clm-NEON-FATES-YELL.0124-200432de_gnu-params.nc
SMS_Lm13.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_gnu.clm-FatesCold.0124-200432de_gnu-params.nc
SMS_Lm13.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold.0124-200432de_int-params.nc
SMS_Lm3_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_gnu.clm-FatesColdHydro.0124-200432de_gnu-params.nc
SMS_Lm6.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-Fates.0124-200432de_int-params.nc

rgknox avatar Jan 25 '24 16:01 rgknox

This should probably happen in a SystemTest, not in shell_commands. See #2335.

samsrabin avatar Jan 25 '24 16:01 samsrabin

See also discussion from CTSM SE meeting here.

samsrabin avatar Jan 25 '24 16:01 samsrabin

I'm unable to get this to work on Izumi. It might be something wrong with my environment. Could someone else try this and see?

./run_sys_tests --skip-compare --skip-generate -t ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream

samsrabin avatar Feb 12 '24 19:02 samsrabin

I'm unable to get this to work on Izumi. It might be something wrong with my environment. Could someone else try this and see?

./run_sys_tests --skip-compare --skip-generate -t ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream

I'm seeing a failure trying this as well with the following:

RUN: /scratch/cluster/glemieux/tests_0215-110541iz/ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream.0215-110541iz/shell_commands
FROM: /scratch/cluster/glemieux/tests_0215-110541iz/ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream.0215-110541iz
  stat: 1

  errput: ncgen: No such file or directory
        (../../ncgen/genbin.c:58)
Traceback (most recent call last):
  File "/home/glemieux/ctsm/src/fates/tools/modify_fates_paramfile.py", line 355, in <module>
    main()
  File "/home/glemieux/ctsm/src/fates/tools/modify_fates_paramfile.py", line 95, in main
    shutil.copyfile(args.inputfname, tempfilename)
  File "/cluster/anaconda-23.11.0/lib/python3.11/shutil.py", line 256, in copyfile
    with open(src, 'rb') as fsrc:
         ^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/glemieux/ctsm/src/fates/parameter_files/binaries/ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream.0215-110541iz-params.nc'
Leaving broken case dir /scratch/cluster/glemieux/tests_0215-110541iz/ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream.0215-110541iz

Looking at the fates directory structure, the binaries directory isn't getting built ~for some reason~ because ncgen isn't found. @samsrabin is this the same error you were seeing?

glemieux avatar Feb 15 '24 18:02 glemieux

That is indeed the same error I was seeing, but I don't think it's ncgen not being found. I think there might be a problem with the ncgen installation.

samsrabin avatar Feb 15 '24 18:02 samsrabin

I'm actually getting a similar error on Derecho, too. From SMS_D_Ld3.f10_f10_mg37.I2000Clm50FatesRs.derecho_intel:

Adding user mods directory /glade/u/home/samrabin/ctsm_fates-auto-params/cime_config/testdefs/testmods_dirs/clm/Fates
RUN: /glade/derecho/scratch/samrabin/tests_0215-114155de/SMS_D_Ld3.f10_f10_mg37.I2000Clm50FatesRs.derecho_intel.clm-Fates.0215-114155de/shell_commands
FROM: /glade/derecho/scratch/samrabin/tests_0215-114155de/SMS_D_Ld3.f10_f10_mg37.I2000Clm50FatesRs.derecho_intel.clm-Fates.0215-114155de
  stat: 1

  errput: ncgen: No such file or directory
	(/home/conda/feedstock_root/build_artifacts/libnetcdf_1650908392318/work/ncgen/genbin.c:genbin_netcdf:63)

Tests in /glade/derecho/scratch/samrabin/tests_0215-114155de/.

samsrabin avatar Feb 15 '24 19:02 samsrabin

The issue is that ${SRCROOT}/src/fates/parameter_files/binaries/ doesn't exist. Will fix in my PR.

samsrabin avatar Feb 15 '24 19:02 samsrabin

@samsrabin thanks for the work on this. Note, that fates/parameter_files is under the FATES external so adding a binary_files subdirectory would require a PR to FATES. And I think for git, you have to have at least a README file in the directory for it to show up when you check it out...

ekluzek avatar Feb 15 '24 19:02 ekluzek

@ekluzek ~~Good point; adding this directory will make the FATES checkout unclean. @rgknox I think you need to make a new FATES tag that has an empty parameter_files/binaries/ directory, then update Externals.cfg here to point to that.~~ No; see below.

samsrabin avatar Feb 15 '24 19:02 samsrabin

Wait, @ekluzek, even if the new directory is canonically in FATES, won't the checkout be unclean once the parameter file is generated?

samsrabin avatar Feb 15 '24 19:02 samsrabin

Actually… the checkout looks clean. manage_externals/checkout_externals -S gives no warning, and git status in src/fates is clean, even with the new directory and parameter files generated. This might be because .nc files are ignored by src/fates/.gitignore.

samsrabin avatar Feb 15 '24 19:02 samsrabin

@samsrabin yes exactly. But, it's good that you showed that's the case. It's good to confirm.

ekluzek avatar Feb 15 '24 20:02 ekluzek

That directory needs to be added in fates, sorry for that, I'll get it added to the next FATES PR. UPDATE: Sam added a mkdir -p call to the scripting, so we don't need this directory added anymore.

rgknox avatar Feb 15 '24 21:02 rgknox

That merge commit I just did was to resolve conflicts introduced in my PR. They were only related to run_sys_tests.py and its testing. They're now resolved, and make all in python/ is still clean.

samsrabin avatar Feb 22 '24 17:02 samsrabin

clm_aux on derecho, ok with exception:

~~FAIL SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop MODEL_BUILD time=212~~ (THIS TEST PASSES AFTER RESUBMITTING)

FAIL DAE_C2_D_Lh12.f10_f10_mg37.I2000Clm50BgcCrop.derecho_intel.clm-DA_multidrv RUN time=304

ERROR: ERROR: Unrecognized line ('/bin/bash: module: line 1: syntax error: unexpected end of file

rgknox avatar Feb 24 '24 14:02 rgknox

That dang DAE test! Try resubmitting it.

samsrabin avatar Feb 24 '24 20:02 samsrabin

I tried the izumi test that was failing before and it works for me, so I checked that item off, which puts this in a ready to merge mode.

ekluzek avatar Feb 26 '24 23:02 ekluzek

This test fails create case: ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream

RUN: /scratch/cluster/rgknox/tests_0226-121631iz/ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream.GC.0226-121631iz_nag/shell_commands
FROM: /scratch/cluster/rgknox/tests_0226-121631iz/ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream.GC.0226-121631iz_nag
  stat: 1

  errput: Traceback (most recent call last):
  File "/home/rgknox/ctsm/src/fates/tools/modify_fates_paramfile.py", line 35, in <module>
    from scipy.io import netcdf as nc
ImportError: No module named scipy.io
Leaving broken case dir /scratch/cluster/rgknox/tests_0226-121631iz/ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream.GC.0226-121631iz_nag
ERROR: Command: '/scratch/cluster/rgknox/tests_0226-121631iz/ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream.GC.0226-121631iz_nag/shell_commands' failed with error 'Traceback (most recent call last):
  File "/home/rgknox/ctsm/src/fates/tools/modify_fates_paramfile.py", line 35, in <module>
    from scipy.io import netcdf as nc
ImportError: No module named scipy.io' from dir '/scratch/cluster/rgknox/tests_0226-121631iz/ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream.GC.0226-121631iz_nag'

 ---------------------------------------------------
2024-02-26 13:08:10: CREATE_NEWCASE FAILED for test 'ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream'.

However, i'm able to load scipy manually when I run python. Also, this test passed when I ran it stand-alone.. ie, this test passed:

./create_test ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream

rgknox avatar Feb 27 '24 02:02 rgknox

DAE_C2_D_Lh12.f10_f10_mg37.I2000Clm50BgcCrop.derecho_intel.clm-DA_multidrv also still fails after re-submitting

rgknox avatar Feb 27 '24 02:02 rgknox

Since this isn't critical to come in now, we will plan on delaying this to fix the conda env issue on izumi (I think #2385 will fix this). @samsrabin also has some analysis that shows that there is a race condition for the DAE test that sometime results in a file being gzipped before something else needs I don't think the DAE issue should hold this one up, but that is another good thing to have come in.

ekluzek avatar Feb 28 '24 17:02 ekluzek