CTSM icon indicating copy to clipboard operation
CTSM copied to clipboard

Refactoring of neon_site into `tower_site` and `neon_site`

Open TeaganKing opened this issue 1 year ago • 7 comments

Description of changes

This PR will create a TowerSite class which defines common functionalities that are in both NeonSite and Plumber2Site classes. With this structure, we will be able to implement a Plumber2Site class parallel to the NeonSite class that utilizes structure and functionalities in the parent classes.

Contributors other than yourself, if any:

@ekluzek @adrifoster

CTSM Issues Fixed (include github issue #):

This will address part of https://github.com/ESCOMP/CTSM/issues/1487

Are answers expected to change (and if so in what way)?

No, this PR should be BFB.

Any User Interface Changes (namelist or namelist defaults changes)?

No

Note: I updated the branch I was using, so this is a continuation of the work done in #2344

TeaganKing avatar Feb 13 '24 21:02 TeaganKing

Just a heads up that I've moved this to point at the new b4b-dev branch instead of master. This should allow it to avoid backlog and get merged sooner once it's ready.

samsrabin avatar Feb 13 '24 22:02 samsrabin

@ekluzek @adrifoster per our conversation last week, I'm adding you both as reviewers! Note that some changes may be made to pull out NEON-specific or PLUMBER-specific features once the PLUMBER class creation is underway.

TeaganKing avatar Feb 20 '24 17:02 TeaganKing

There are some tests for modifying user namelists in CTSM/python/ctsm/test/test_unit_neon_site.py. Some additional discussion on unit tests' current implementation is in #2315.

I have addressed other comments that relate to what we wanted to do now.

Regarding 'make all', I'm hoping to test this but am currently out of scratch space and am thus running into errors... Fred is making room elsewhere for a simulation that's taking up the majority of that space, so hopefully I can test this soon.

TeaganKing avatar Feb 23 '24 22:02 TeaganKing

Quick update, I'm currently running into some errors with test_sys_run_neon.TestSysRunNeon in make all but not with python3 ./run_ctsm_py_tests --sys. In Sam Rabin's comment mentioned here, changing directories out of the temp dir resolved this issue previously; however, that doesn't seem to be resolving the issue in this case. I am thinking this may be a result of building in tower_site instead of neon_site, but I still need to look into that in more detail to confirm.

The following pylint errors are also present:

************* Module ctsm.crop_calendars.process_ggcmi_shdates
ctsm/crop_calendars/process_ggcmi_shdates.py:450:4: C0206: Consider iterating with .items() (consider-using-dict-items)
ctsm/crop_calendars/process_ggcmi_shdates.py:501:8: C0206: Consider iterating with .items() (consider-using-dict-items)
************* Module ctsm.crop_calendars.grid_one_variable
ctsm/crop_calendars/grid_one_variable.py:25:16: R1735: Consider using {} instead of dict() (use-dict-literal)
************* Module ctsm.test.test_unit_singlept_data_surfdata
ctsm/test/test_unit_singlept_data_surfdata.py:489:8: E1137: "self.ds_test['PCT_URBAN']" does not support item assignment (unsupported-assignment-operation)
ctsm/test/test_unit_singlept_data_surfdata.py:1051:8: E1137: "self.ds_test['PCT_URBAN']" does not support item assignment (unsupported-assignment-operation)

TeaganKing avatar Feb 26 '24 21:02 TeaganKing

@TeaganKing do you need some help with this or can you continue to figure this out? Maybe we could meet say Friday sometime?

The changing out of temp-dir thing you mention is sometimes tricky because you have to do it on the test that's before the one where it breaks. So it might appear in the files you worked with, but it really originates in whatever test was done before yours. Hopefully, that makes sense enough to fix it, but otherwise let's meet later...

ekluzek avatar Feb 26 '24 21:02 ekluzek

@ekluzek It does look like the temp dir that's causing errors was set up during test_sys_run_neon.test_one_site, which makes me think that the issue is actually during this step? I'll look into this more tomorrow and if I'm still stuck, I'll reach out for help on Friday. Thanks!

TeaganKing avatar Feb 26 '24 23:02 TeaganKing

A few additional notes on the make all test error:

  • Running python run_ctsm_pytests -s worked, but make all failed.
  • Commenting out test_sys_run_neon.test_one_site results in all tests passing; so the error must be encountered during that particular test rather than as a result of other tests.
  • This error is not encountered when running neon at a particular site, eg with ./run_neon --neon-sites DEJU, the model builds and runs successfully.
  • Setting up real (permanent throughout the test) directories instead of temporary directories (that are removed) in test_sys_run_neon also did not resolve the issue.

Error details: Occurs after this expected message:

---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------
---- base case build ------
--- This may take a while and you may see WARNING messages ---

Error:

======================================================================
ERROR: test_one_site (test.test_sys_run_neon.TestSysRunNeon)
This test specifies a site to run
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 704, in run_sub_or_cmd
    getattr(mod, subname)(*subargs)
  File "/glade/work/tking/temp/CTSM/cime/CIME/build_scripts/buildlib.mpi-serial", line 74, in buildlib
    run_bld_cmd_ensure_logging(cmd, logger)
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 2611, in run_bld_cmd_ensure_logging
    expect(stat == 0, filter_unicode(errput))
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 175, in expect
    raise exc_type(msg)
CIME.utils.CIMEError: ERROR: cat: Filepath: No such file or directory
/glade/derecho/scratch/tking/tmp/tmp7i5f_fyy/BART/Tools/Makefile:195: *** NETCDF not found: Define NETCDF_PATH or NETCDF_C_PATH and NETCDF_FORTRAN_PATH in config_machines.xml or cmake_macros.  Stop.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/glade/work/tking/temp/CTSM/python/ctsm/test/test_sys_run_neon.py", line 61, in test_one_site
    main("")
  File "/glade/work/tking/temp/CTSM/python/ctsm/site_and_regional/run_neon.py", line 227, in main
    cesmroot, output_root, res, compset, user_mods_dirs, overwrite, setup_only
  File "/glade/work/tking/temp/CTSM/python/ctsm/site_and_regional/neon_site.py", line 51, in build_base_case
    case_path = super().build_base_case(cesmroot, output_root, res, compset, user_mods_dirs)
  File "/glade/work/tking/temp/CTSM/python/ctsm/site_and_regional/tower_site.py", line 154, in build_base_case
    build.case_build(case_path, case=case)
  File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 1329, in case_build
    return run_and_log_case_status(functor, cb, caseroot=caseroot)
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 2477, in run_and_log_case_status
    rv = func()
  File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 1320, in <lambda>
    dry_run,
  File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 1212, in _case_build_impl
    complist,
  File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 840, in _build_libraries
    logfile=file_build,
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 716, in run_sub_or_cmd
    expect(False, "{} FAILED, cat {}".format(cmd, logfile))
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 175, in expect
    raise exc_type(msg)
CIME.utils.CIMEError: ERROR: /glade/work/tking/temp/CTSM/cime/CIME/build_scripts/buildlib.mpi-serial FAILED, cat /glade/derecho/scratch/tking/tmp/tmp7i5f_fyy/BART/bld/mpi-serial.bldlog.240227-101229

@ekluzek I will reach out to you for some help sorting out these details!

TeaganKing avatar Feb 27 '24 23:02 TeaganKing

I wanted to note a few more thoughts on these test failures after meeting with @ekluzek earlier today.

  1. We noticed that the tests were succesful prior to some derecho updates / machine maintenance, so this could possibly be tied to the updates to a newer version of CIME.
  2. Case.build within the test case directory works fine; this shows that the case setup is done as expected, but for some reason the method of calling case build from within run_neon with the test setup seems to be failing.
  3. We considered that there could be an absolute path error; however, using the same absolute paths as in the test and running by hand is successful (eg, /glade/u/home/tking/work/temp/CTSM/tools/site_and_regional/run_neon --neon-sites BART --setup-only --output-root /glade/derecho/scratch/tking/tmp/temp ).
  4. Looking into Makefile,in the python directory it looks pretty straightforward and does not change directories or do anything that would affect a full path.
  5. Note that make all runs make utest and make stest (the latter being the relevant part to this issue). make stest fails.
  6. ./run_ctsm_py_tests -s succeeds and uses python3 but through the shebang line
  7. Running module load nco and then python3 ./run_ctsm_py_tests --sys is also successful (which is essentially what make stest does).

This issue is probably tied to details regarding how our testing is performed, and may require some additional brainstorming with folks who are more familiar with our testing suite.

TeaganKing avatar Mar 01 '24 19:03 TeaganKing

I replicated the above behavior in my own sandbox. So it's reproducible. I don't know what might be different inside the Makefile environment than outside, which makes this strange.

A previous problem we've run into is that we are removing directory trees, but not making sure that isn't the current directory. So I tried making that update to other places, and it acts the same way. I do see something that does need to be fixed in the main code that I'll add a comment for. I'm doubtful that's the issue, but it is something good to fix.

ekluzek avatar Mar 05 '24 16:03 ekluzek

I did also notice that with a fresh CTSM checkout, if I use the tower_site.py and other files that I've changed, this fails, but if I use the old files, 'make all' does not fail... So this probably is also a result of splitting up run_neon into multiple classes/files. @ekluzek when you say you tested this in your own sandbox, I assume that is with these code changes specifically rather than the current CTSM code?

Specifically moving build_base_case back into neon_site.py results in successful tests.

TeaganKing avatar Mar 05 '24 17:03 TeaganKing

@TeaganKing yes exactly. It fails in my sandbox where I checked out your branch, but it passes in master or b4b-dev.

I don't think it's really about anything that you did, but something more subtle in the testing. What I'd like to do is to see if we can step through what's going on with a python debugger, or add more print statements to each thing going on to see how the Makefile environment is different from running directly. One guess I have is that Maybe there are different environment variable settings when run from inside the Makefile, than when run outside of make. Just checked that and the Makefile adds the following env settings:

> MAKEFLAGS=
> MAKELEVEL=1
> MAKE_TERMERR=/dev/pts/1
> MAKE_TERMOUT=/dev/pts/1
144a149
> MFLAGS=

Which doesn't look like it should matter...

Sometimes when new code is added the code can be fine, but it exposes an existing problem. This is rare, but when it happens it becomes really hard to track down.

ekluzek avatar Mar 05 '24 18:03 ekluzek

In a discussion at the CTSM software meeting, we decided to remove the make stest and make utest components of the makefile due to redundancy with the python tests, so we will not resolve the make all issue mentioned above. The makefile changes will be done in a separate PR. The python tests were successful, so this should be good to merge in. @ekluzek I made some comments to your suggestions above.

TeaganKing avatar Mar 05 '24 22:03 TeaganKing

Thanks Erik. Just to put everything in one place, I will describe all of the unresolved conversations here.

Testing-related issues which we decided to leave as is for now: In neon_site.py, we could add an error check that the user_mods_dirs directory is valid. TowerSite.str could be unit tested, although it's probably not essential at the moment. TowerSite.get_batch_query and set_ref_case could be unit tested.

Testing outside the scope of this work: We should generally add a utility for removing a directory tree since this problem seems to be repeatedly causing errors. This should be a new issue ticket that will probably also be relevant in other parts of CTSM, in addition to TowerSite.build_base_case and TowerSite.run_case.

Further Refactoring: The TowerSite.build_base_case section that starts with with Case(case_path, read_only=False) as case: could be brought out into a separate method. run_case should also be broken down further in the future.

TeaganKing avatar Mar 06 '24 17:03 TeaganKing

Cool, thanks @TeaganKing. I want to get this into b4b-dev so it'll be on the next tag there that we move to main. Which means I need to get it on by tomorrow. So I'll just open those issues talked about above, and close the conversations and all that.

I also did some work on the branch to deal with the temporary directory issue, so I'll check that in. It didn't solve it, but should improve the future. So I might as well check that in here.

Thanks for the work on this. Looking forward to the next piece here.

ekluzek avatar Mar 06 '24 17:03 ekluzek

Awesome, thank you Erik!

TeaganKing avatar Mar 06 '24 17:03 TeaganKing

python testing is good and clm_pymods tests on Derecho pass. So merging in.

ekluzek avatar Mar 19 '24 23:03 ekluzek