Refactoring of neon_site into `tower_site` and `neon_site`
Description of changes
This PR will create a TowerSite class which defines common functionalities that are in both NeonSite and Plumber2Site classes. With this structure, we will be able to implement a Plumber2Site class parallel to the NeonSite class that utilizes structure and functionalities in the parent classes.
Contributors other than yourself, if any:
@ekluzek @adrifoster
CTSM Issues Fixed (include github issue #):
This will address part of https://github.com/ESCOMP/CTSM/issues/1487
Are answers expected to change (and if so in what way)?
No, this PR should be BFB.
Any User Interface Changes (namelist or namelist defaults changes)?
No
Note: I updated the branch I was using, so this is a continuation of the work done in #2344
Just a heads up that I've moved this to point at the new b4b-dev branch instead of master. This should allow it to avoid backlog and get merged sooner once it's ready.
@ekluzek @adrifoster per our conversation last week, I'm adding you both as reviewers! Note that some changes may be made to pull out NEON-specific or PLUMBER-specific features once the PLUMBER class creation is underway.
There are some tests for modifying user namelists in CTSM/python/ctsm/test/test_unit_neon_site.py.
Some additional discussion on unit tests' current implementation is in #2315.
I have addressed other comments that relate to what we wanted to do now.
Regarding 'make all', I'm hoping to test this but am currently out of scratch space and am thus running into errors... Fred is making room elsewhere for a simulation that's taking up the majority of that space, so hopefully I can test this soon.
Quick update, I'm currently running into some errors with test_sys_run_neon.TestSysRunNeon in make all but not with python3 ./run_ctsm_py_tests --sys. In Sam Rabin's comment mentioned here, changing directories out of the temp dir resolved this issue previously; however, that doesn't seem to be resolving the issue in this case. I am thinking this may be a result of building in tower_site instead of neon_site, but I still need to look into that in more detail to confirm.
The following pylint errors are also present:
************* Module ctsm.crop_calendars.process_ggcmi_shdates
ctsm/crop_calendars/process_ggcmi_shdates.py:450:4: C0206: Consider iterating with .items() (consider-using-dict-items)
ctsm/crop_calendars/process_ggcmi_shdates.py:501:8: C0206: Consider iterating with .items() (consider-using-dict-items)
************* Module ctsm.crop_calendars.grid_one_variable
ctsm/crop_calendars/grid_one_variable.py:25:16: R1735: Consider using {} instead of dict() (use-dict-literal)
************* Module ctsm.test.test_unit_singlept_data_surfdata
ctsm/test/test_unit_singlept_data_surfdata.py:489:8: E1137: "self.ds_test['PCT_URBAN']" does not support item assignment (unsupported-assignment-operation)
ctsm/test/test_unit_singlept_data_surfdata.py:1051:8: E1137: "self.ds_test['PCT_URBAN']" does not support item assignment (unsupported-assignment-operation)
@TeaganKing do you need some help with this or can you continue to figure this out? Maybe we could meet say Friday sometime?
The changing out of temp-dir thing you mention is sometimes tricky because you have to do it on the test that's before the one where it breaks. So it might appear in the files you worked with, but it really originates in whatever test was done before yours. Hopefully, that makes sense enough to fix it, but otherwise let's meet later...
@ekluzek It does look like the temp dir that's causing errors was set up during test_sys_run_neon.test_one_site, which makes me think that the issue is actually during this step? I'll look into this more tomorrow and if I'm still stuck, I'll reach out for help on Friday. Thanks!
A few additional notes on the make all test error:
- Running
python run_ctsm_pytests -sworked, butmake allfailed. - Commenting out
test_sys_run_neon.test_one_siteresults in all tests passing; so the error must be encountered during that particular test rather than as a result of other tests. - This error is not encountered when running neon at a particular site, eg with
./run_neon --neon-sites DEJU, the model builds and runs successfully. - Setting up real (permanent throughout the test) directories instead of temporary directories (that are removed) in
test_sys_run_neonalso did not resolve the issue.
Error details: Occurs after this expected message:
---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------
---- base case build ------
--- This may take a while and you may see WARNING messages ---
Error:
======================================================================
ERROR: test_one_site (test.test_sys_run_neon.TestSysRunNeon)
This test specifies a site to run
----------------------------------------------------------------------
Traceback (most recent call last):
File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 704, in run_sub_or_cmd
getattr(mod, subname)(*subargs)
File "/glade/work/tking/temp/CTSM/cime/CIME/build_scripts/buildlib.mpi-serial", line 74, in buildlib
run_bld_cmd_ensure_logging(cmd, logger)
File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 2611, in run_bld_cmd_ensure_logging
expect(stat == 0, filter_unicode(errput))
File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 175, in expect
raise exc_type(msg)
CIME.utils.CIMEError: ERROR: cat: Filepath: No such file or directory
/glade/derecho/scratch/tking/tmp/tmp7i5f_fyy/BART/Tools/Makefile:195: *** NETCDF not found: Define NETCDF_PATH or NETCDF_C_PATH and NETCDF_FORTRAN_PATH in config_machines.xml or cmake_macros. Stop.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/glade/work/tking/temp/CTSM/python/ctsm/test/test_sys_run_neon.py", line 61, in test_one_site
main("")
File "/glade/work/tking/temp/CTSM/python/ctsm/site_and_regional/run_neon.py", line 227, in main
cesmroot, output_root, res, compset, user_mods_dirs, overwrite, setup_only
File "/glade/work/tking/temp/CTSM/python/ctsm/site_and_regional/neon_site.py", line 51, in build_base_case
case_path = super().build_base_case(cesmroot, output_root, res, compset, user_mods_dirs)
File "/glade/work/tking/temp/CTSM/python/ctsm/site_and_regional/tower_site.py", line 154, in build_base_case
build.case_build(case_path, case=case)
File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 1329, in case_build
return run_and_log_case_status(functor, cb, caseroot=caseroot)
File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 2477, in run_and_log_case_status
rv = func()
File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 1320, in <lambda>
dry_run,
File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 1212, in _case_build_impl
complist,
File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 840, in _build_libraries
logfile=file_build,
File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 716, in run_sub_or_cmd
expect(False, "{} FAILED, cat {}".format(cmd, logfile))
File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 175, in expect
raise exc_type(msg)
CIME.utils.CIMEError: ERROR: /glade/work/tking/temp/CTSM/cime/CIME/build_scripts/buildlib.mpi-serial FAILED, cat /glade/derecho/scratch/tking/tmp/tmp7i5f_fyy/BART/bld/mpi-serial.bldlog.240227-101229
@ekluzek I will reach out to you for some help sorting out these details!
I wanted to note a few more thoughts on these test failures after meeting with @ekluzek earlier today.
- We noticed that the tests were succesful prior to some derecho updates / machine maintenance, so this could possibly be tied to the updates to a newer version of CIME.
- Case.build within the test case directory works fine; this shows that the case setup is done as expected, but for some reason the method of calling case build from within run_neon with the test setup seems to be failing.
- We considered that there could be an absolute path error; however, using the same absolute paths as in the test and running by hand is successful (eg,
/glade/u/home/tking/work/temp/CTSM/tools/site_and_regional/run_neon --neon-sites BART --setup-only --output-root /glade/derecho/scratch/tking/tmp/temp). - Looking into Makefile,in the
pythondirectory it looks pretty straightforward and does not change directories or do anything that would affect a full path. - Note that
make allrunsmake utestandmake stest(the latter being the relevant part to this issue).make stestfails. ./run_ctsm_py_tests -ssucceeds and usespython3but through the shebang line- Running
module load ncoand thenpython3 ./run_ctsm_py_tests --sysis also successful (which is essentially what make stest does).
This issue is probably tied to details regarding how our testing is performed, and may require some additional brainstorming with folks who are more familiar with our testing suite.
I replicated the above behavior in my own sandbox. So it's reproducible. I don't know what might be different inside the Makefile environment than outside, which makes this strange.
A previous problem we've run into is that we are removing directory trees, but not making sure that isn't the current directory. So I tried making that update to other places, and it acts the same way. I do see something that does need to be fixed in the main code that I'll add a comment for. I'm doubtful that's the issue, but it is something good to fix.
I did also notice that with a fresh CTSM checkout, if I use the tower_site.py and other files that I've changed, this fails, but if I use the old files, 'make all' does not fail... So this probably is also a result of splitting up run_neon into multiple classes/files. @ekluzek when you say you tested this in your own sandbox, I assume that is with these code changes specifically rather than the current CTSM code?
Specifically moving build_base_case back into neon_site.py results in successful tests.
@TeaganKing yes exactly. It fails in my sandbox where I checked out your branch, but it passes in master or b4b-dev.
I don't think it's really about anything that you did, but something more subtle in the testing. What I'd like to do is to see if we can step through what's going on with a python debugger, or add more print statements to each thing going on to see how the Makefile environment is different from running directly. One guess I have is that Maybe there are different environment variable settings when run from inside the Makefile, than when run outside of make. Just checked that and the Makefile adds the following env settings:
> MAKEFLAGS=
> MAKELEVEL=1
> MAKE_TERMERR=/dev/pts/1
> MAKE_TERMOUT=/dev/pts/1
144a149
> MFLAGS=
Which doesn't look like it should matter...
Sometimes when new code is added the code can be fine, but it exposes an existing problem. This is rare, but when it happens it becomes really hard to track down.
In a discussion at the CTSM software meeting, we decided to remove the make stest and make utest components of the makefile due to redundancy with the python tests, so we will not resolve the make all issue mentioned above. The makefile changes will be done in a separate PR. The python tests were successful, so this should be good to merge in. @ekluzek I made some comments to your suggestions above.
Thanks Erik. Just to put everything in one place, I will describe all of the unresolved conversations here.
Testing-related issues which we decided to leave as is for now:
In neon_site.py, we could add an error check that the user_mods_dirs directory is valid.
TowerSite.str could be unit tested, although it's probably not essential at the moment.
TowerSite.get_batch_query and set_ref_case could be unit tested.
Testing outside the scope of this work:
We should generally add a utility for removing a directory tree since this problem seems to be repeatedly causing errors. This should be a new issue ticket that will probably also be relevant in other parts of CTSM, in addition to TowerSite.build_base_case and TowerSite.run_case.
Further Refactoring:
The TowerSite.build_base_case section that starts with with Case(case_path, read_only=False) as case: could be brought out into a separate method. run_case should also be broken down further in the future.
Cool, thanks @TeaganKing. I want to get this into b4b-dev so it'll be on the next tag there that we move to main. Which means I need to get it on by tomorrow. So I'll just open those issues talked about above, and close the conversations and all that.
I also did some work on the branch to deal with the temporary directory issue, so I'll check that in. It didn't solve it, but should improve the future. So I might as well check that in here.
Thanks for the work on this. Looking forward to the next piece here.
Awesome, thank you Erik!
python testing is good and clm_pymods tests on Derecho pass. So merging in.