CTSM Have fsurdat_modifier system test use conda environment rather than ncar

Description of changes

Use the manage_python_env script to setup the conda environment and use it rather than ncar_pylib

Specific notes

Contributors other than yourself, if any: None

CTSM Issues Fixed (include github issue #): Fixes #1786

Are answers expected to change (and if so in what way)? No

Any User Interface Changes (namelist or namelist defaults changes)? No

Testing performed, if any: FSURDATMODIFYCTSM_D_Mmpi-serial_Ld1.5x5_amazon.I2000Clm50SpRs.cheyenne_intel runs and PASSes on cheyenne

Jul 06 '22 23:07 ekluzek

Thanks for working on this. It's not as pretty or simple as I hoped, but it's still an improvement over what we had before - and I really hope any such mechanism can go away before too long.

Can you confirm two things:

Does this work both for a user's shell being csh and bash? (given the shell-specific logic you have here)
Does it work to run FSURDATMODIFYCTSM as one of a suite of tests? I'm both wanting to confirm that earlier tests don't mess anything up about the environment that would interfere with this one (which I think I've seen before) and that this one doesn't mess anything up with the environment that could impact following tests.

Jul 07 '22 04:07 billsacks

I think I can improve on what I have here, so I'll try to do that. But, I also really like your suggestions on tests as well, so I'll do that.

To simplify this further we'll need conda to be loaded on machines in ccs_config/cime rather than python. That should be OK as when you activate an environment with conda you get a python version with it. But, we've got to convince CESM people that doing that would be OK.

Jul 07 '22 07:07 ekluzek

I'm not sure it's worth spending much more time on this. (Sorry: my last comment probably suggested that it would be good to improve this, but I don't really think that should be a priority.) The vision we have discussed in CSEG is that users will load an appropriate python environment before doing anything with CESM / CIME. We don't want CIME to be responsible for controlling the python environment. Once we are comfortable with that (maybe the switch to derecho will be a good time???), I believe we can remove any 'module load python' from CIME. (I don't think we want a conda load in ccs_config/cime: I think we want to leave this up to the user - at least based on previous discussions.) So I feel like a less-than-elegant solution is okay for now as long as it works, and then we can hopefully remove this soon.

Jul 07 '22 13:07 billsacks

Also: I appreciate that this new solution works on multiple machines, not just cheyenne.

Jul 07 '22 13:07 billsacks

One other thing to confirm, if you haven't already, is: if there are failures in various aspects of the subprocess call, make sure that the test stops at that point with a FAIL result noted in the TestStatus file. You should definitely confirm this is the case for a possible failure in running fsurdat_modifier (e.g., deliberately introduce an error there and make sure the test aborts appropriately), but I'd also want to know that a failure in an earlier step in the subprocess call (where you have strung a bunch of commands together) correctly results in an error being reported. (I see you use check=True, but I think that will only work right if you get an error return code from the subprocess. Given the complexity of the subprocess call, it isn't immediately obvious to me that that is guaranteed to be the case.)

Jul 07 '22 18:07 billsacks

Thanks for working on this. It's not as pretty or simple as I hoped, but it's still an improvement over what we had before

Thanks, @ekluzek It's not a lot of code, but I doubt that I would have figured this out without help :-) Let me know if you'd like me to try any or all of the testing that @billsacks recommended.

Jul 11 '22 21:07 slevis-lmwg

Oh, I had forgotten about this one too. Thanks @billsacks

Dec 14 '22 19:12 ekluzek

@slevisconsulting I'm not going to be able to get back to working on this one for awhile, I wonder if you shouldn't merge this into your PR and shepherd bringing it in to CTSM main?

Dec 15 '22 16:12 ekluzek

@slevisconsulting I'm not going to be able to get back to working on this one for awhile, I wonder if you shouldn't merge this into your PR and shepherd bringing it in to CTSM main?

@ekluzek looks as though I can push directly to your PR, so I will work here for now, and we will likely close the redundant PR that I opened (#1925)

Dec 15 '22 21:12 slevis-lmwg

@slevisconsulting I think some of the documentation changes you have in #1925, might still apply. So you could merge this into that, or the other way around, whichever you prefer. I'm glad you'll be able to work on it though, it would be nice to see this come in.

Dec 15 '22 21:12 ekluzek

Out of the box ./create_test FSURDATMODIFYCTSM_D_Mmpi-serial_Ld1.5x5_amazon.I2000Clm50SpRs.cheyenne_intel -c /glade/p/cgd/tss/ctsm_baselines/ctsm5.1.dev115 fails with

Lmod has detected the following error: The following module(s) are unkno                                                                                                                                  wn:
        "lang/python"

        Please check the spelling or version number. Also try "module spider ...                                                                                                                                  "
        It is also possible your cache file is out-of-date; it may help to try:
          $ module --ignore-cache load "lang/python"

        Also make sure that all modulefiles written in TCL start with the string
        #%Module



        python3: can't open file '/glade/work/slevis/git/mksurfdata_toolchain/to                                                                                                                                  ols/modify_fsurdat/fsurdat_modifier': [Errno 2] No such file or directory
        Traceback (most recent call last):
          File "./case.build", line 256, in <module>
            _main_func(__doc__)
          File "./case.build", line 216, in _main_func
            test = find_system_test(testname, case)(case)
          File "/glade/work/slevis/git/mksurfdata_toolchain/cime_config/SystemTe                                                                                                                                  sts/fsurdatmodifyctsm.py", line 44, in __init__
            self._run_modify_fsurdat()
          File "/glade/work/slevis/git/mksurfdata_toolchain/cime_config/SystemTe                                                                                                                                  sts/fsurdatmodifyctsm.py", line 74, in _run_modify_fsurdat
            subprocess.run( conda_env+"python3 "+tool_path+" "+self._cfg_file_pa                                                                                                                                  th, shell=True, check=True)
          File "/glade/u/apps/ch/opt/python/3.7.9/gnu/9.1.0/lib/python3.7/subpro                                                                                                                                  cess.py", line 512, in run
            output=stdout, stderr=stderr)
        subprocess.CalledProcessError: Command '. /glade/scratch/slevis/FSURDATM                                                                                                                                  ODIFYCTSM_D_Mmpi-serial_Ld1.5x5_amazon.I2000Clm50SpRs.cheyenne_intel.C.20221215_                                                                                                                                  143118_58x138/.env_mach_specific.sh; module unload python; module load conda; mo                                                                                                                                  dule load lang/python; /glade/work/slevis/git/mksurfdata_toolchain/py_env_create                                                                                                                                   && conda activate ctsm_pylib && python3 /glade/work/slevis/git/mksurfdata_toolc                                                                                                                                  hain/tools/modify_fsurdat/fsurdat_modifier /glade/scratch/slevis/FSURDATMODIFYCT                                                                                                                                  SM_D_Mmpi-serial_Ld1.5x5_amazon.I2000Clm50SpRs.cheyenne_intel.C.20221215_143118_                                                                                                                                  58x138/modify_fsurdat.cfg' returned non-zero exit status 2.

In the last line, I think I need to remove the module load lang/python based on how I got fsurdat_modifier to work in interactive mode.

Dec 15 '22 21:12 slevis-lmwg

Previous test passed. Next testing the full clm_pymods test-suite ./run_sys_tests -s clm_pymods -c ctsm5.1.dev115 --testroot-base /glade/scratch/slevis --skip-generate

Dec 15 '22 22:12 slevis-lmwg

In the test-suite, the same test stopped with: FAIL FSURDATMODIFYCTSM_D_Mmpi-serial_Ld1.5x5_amazon.I2000Clm50SpRs.cheyenne_intel SHAREDLIB_BUILD failed to initialize The TestStatus.log suggested a retry, but I got the same error a second time:

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/ncar/linux-64/current_repodata.json>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
'https//conda.anaconda.org/ncar/linux-64'


Traceback (most recent call last):
  File "./case.build", line 256, in <module>
    _main_func(__doc__)
  File "./case.build", line 216, in _main_func
    test = find_system_test(testname, case)(case)
  File "/glade/work/slevis/git/mksurfdata_toolchain/cime_config/SystemTests/fsurdatmodifyctsm.py", line 44, in __init__
    self._run_modify_fsurdat()
  File "/glade/work/slevis/git/mksurfdata_toolchain/cime_config/SystemTests/fsurdatmodifyctsm.py", line 74, in _run_modify_fsurdat
    subprocess.run( conda_env+"python3 "+tool_path+" "+self._cfg_file_path, shell=True, check=True)
  File "/glade/u/apps/ch/opt/python/3.7.9/gnu/9.1.0/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '. /glade/scratch/slevis/tests_1215-151410ch/FSURDATMODIFYCTSM_D_Mmpi-serial_Ld1.5x5_amazon.I2000Clm50SpRs.cheyenne_intel.C.1215-151410ch_int/.env_mach_specific.sh; module unload python; module load conda; /glade/work/slevis/git/mksurfdata_toolchain/py_env_create && conda activate ctsm_pylib && python3 /glade/work/slevis/git/mksurfdata_toolchain/tools/modify_input_files/fsurdat_modifier /glade/scratch/slevis/tests_1215-151410ch/FSURDATMODIFYCTSM_D_Mmpi-serial_Ld1.5x5_amazon.I2000Clm50SpRs.cheyenne_intel.C.1215-151410ch_int/modify_fsurdat.cfg' returned non-zero exit status 254.

@billsacks you alluded to the possibility of encountering problems in the test-suite, though this test is one of two in clm_pymods and it runs first. I have no ideas how to solve this and would welcome suggestions.

Dec 15 '22 22:12 slevis-lmwg

Just in case, I reran the whole clm_pymods test-suite, because earlier I just reran the failed test like this: ./run_sys_tests -s clm_pymods --rerun-existing-failures --testid-base 1215-151410ch -c /glade/p/cgd/tss/ctsm_baselines/ctsm5.1.dev115 --skip-generate However, I get the same error...

Dec 15 '22 22:12 slevis-lmwg

This sounds like a different issue than what I was concerned about (in terms of running in a test suite). My guess is that you're encountering this when running via run_sys_tests because run_sys_tests does the test builds on the compute nodes, whereas I'm guessing that, with your first test, you did the build on the login node. You could verify this by getting an interactive session on a compute node and trying to run the relevant conda commands from there. I'm guessing that the compute nodes don't have the same access permissions to the outside world as the login nodes do.

As for how to solve this, my initial thoughts are: You could first email CISL help ([email protected]), asking them if they can change some permissions to allow this (or if they have thoughts about a way to get this to work from the compute nodes).

If that won't work, then I wonder if we could do something like have run_sys_tests set up the appropriate conda environment for you, and then just have the tests load the already-set-up conda environment. My main concern with that approach is that, if this setting up of the conda environment takes more than a few seconds, I would ideally want this done in a way that doesn't significantly delay the starting of the system tests – especially in cases where you're just running one or a few tests that don't need this conda environment, so this delay would just be a pain.

If setting up the conda environment takes a while and we can't come up with a way to avoid having this setup delay the start of the system tests, then maybe we need to consider requiring anyone running these tests to have the appropriate conda environment already created. I don't like that that makes running the tests harder, but maybe that's our only viable option.

Dec 15 '22 23:12 billsacks

If that won't work, then I wonder if we could do something like have run_sys_tests set up the appropriate conda environment for you, and then just have the tests load the already-set-up conda environment. My main concern with that approach is that, if this setting up of the conda environment takes more than a few seconds, I would ideally want this done in a way that doesn't significantly delay the starting of the system tests – especially in cases where you're just running one or a few tests that don't need this conda environment, so this delay would just be a pain.

If setting up the conda environment takes a while and we can't come up with a way to avoid having this setup delay the start of the system tests, then maybe we need to consider requiring anyone running these tests to have the appropriate conda environment already created. I don't like that that makes running the tests harder, but maybe that's our only viable option.

Or maybe we could do a combination of these two options: If run_sys_tests sees that you already have an appropriately-named conda environment, then it won't do anything; but if you don't yet have one, then it creates a conda environment for you. That way first time users of run_sys_tests will get a CTSM conda environment created for them automatically, and after that it will reuse the existing environment. Major caveat that this is coming from someone with almost no experience with conda environments, so it may not actually work the way I'm imagining it could....

Dec 15 '22 23:12 billsacks

I will start with an email to [email protected] and we'll see how it goes.

Dec 15 '22 23:12 slevis-lmwg

@billsacks the one problem with just checking that a given conda environment exists is that we do expect the conda environment to be updated (hopefully infrequently). So the only way to be sure is to construct the environment each time. You might be right that maybe run_sys_test should create the environment and then system tests just use it though. It does take an annoying amount of time to create the environment, so you wouldn't want to unnecessarily run conda all the time. Maybe one way to do it would be to have run_sys_tests to create the conda environment by default, but have an option to skip it with a command line option. Since, run_sys_tests runs on the login nodes this also bypasses needing to get this to work on the compute nodes. If it's a problem for cheyenne, I imagine it could be a problem on other machines as well.

Dec 15 '22 23:12 ekluzek

I'm ready to email [email protected] but I will hold off in case we have an acceptable solution in the above posts. @ekluzek how about we take a few minutes at the end of the SLIM stand-up tomorrow to discuss?

Dec 15 '22 23:12 slevis-lmwg

the one problem with just checking that a given conda environment exists is that we do expect the conda environment to be updated (hopefully infrequently). So the only way to be sure is to construct the environment each time. You might be right that maybe run_sys_test should create the environment and then system tests just use it though. It does take an annoying amount of time to create the environment, so you wouldn't want to unnecessarily run conda all the time. Maybe one way to do it would be to have run_sys_tests to create the conda environment by default, but have an option to skip it with a command line option. Since, run_sys_tests runs on the login nodes this also bypasses needing to get this to work on the compute nodes. If it's a problem for cheyenne, I imagine it could be a problem on other machines as well.

My hope would be that we aren't so tied to specific versions of python packages that we often need to update the ctsm python environment. Can we start with something simple like using the environment if it exists and creating it if it doesn't, and leave it up to the user to update this environment if it's ever needed?

But this also feels like it's getting messy. Something I seem to remember from discussions in CSEG meetings is the suggestion that, at least on our standard platforms (cheyenne & izumi), we should try to get the appropriate python environments supported at the system level. Have you looked into whether the environment that you get from "conda activate npl" contains everything we need? If it doesn't, would it be feasible to ask CISL to install any missing packages – or try to remove those dependencies from our python code if they are non-standard enough to not be included in npl? Then for testing we could use that system-level environment? I don't know, maybe that creates as many problems as it solves, but this management of python environments is just feeling like such a headache.

Dec 16 '22 00:12 billsacks

I think checking for the environment and using it does make sense. That means they've at least run py_env_create once. And the environment isn't changing that often, so it's probably OK most of the time. And if there is a missing package it will abort, which would be the most common thing. That would just mean that they need to rerun py_env_create again on the login node. Just activating the environment is also fast, so that would have reasonable performance.

On using npl -- I actually think we should NOT do that as I view it as a requirement to get our tools to work on all systems and not just NCAR systems. The advantage of using conda is that we can provide an environment that someone can use for ANY machine. We will only test on cheyenne and izumi, but in principle the conda environment will allow you to work on any machine. Since, we do want to support the university community as well as people using any of the supported CESM machines in ccs_config I think this makes sense as a requirement.

Dec 16 '22 15:12 ekluzek

@ekluzek and I met and agreed on the following:

This test should stop using py_env_create.
I will add graceful aborts for when conda doesn't exist and when ctsm_pylib doesn't exist telling the user that they need to first manually install conda, unload python, load conda, and run py_env_create for this test to work.
I will add text when the tool itself fails, suggesting that the user's ctsm_pylib environment may be out-of-date, and that running py_env_create may resolve the issue.

Dec 16 '22 19:12 slevis-lmwg

@slevisconsulting note I did update the name of the conda environment to ctsm_pylib in this PR, so be sure to use it.

Dec 16 '22 21:12 ekluzek

This works now, though I haven't found a way to separate the single subprocess that

loads conda and activates the ctsm_pylib environment AND
runs the fsurdat_modifier tool

into two or more subprocesses without losing access to the ctsm_pylib environment. So instead I have written a more elaborate error message with the various possible scenarios leading to an error in the subprocess. I have tested it by commenting out the "conda activate" line and confirming that I see my error message in TestStatus.log.

If that's acceptable because we're still only looking at an interim solution, then I think that this PR is ready for pre-merge preparations:

I need to make sure pylint passes
Should I run aux_clm? I don't know if any of the modified files affect anything outside of clm_pymods...

Dec 16 '22 23:12 slevis-lmwg

Okay, all sounds good - thank you @slevisconsulting and @ekluzek . It might be good to confirm that this works in the context of a larger test suite, if you haven't already. This wouldn't necessarily need the full aux_clm, but you could create a text-based testlist file containing this test and a few others – with one or more tests before this one and one or more after it – to make sure this doesn't interfere with the python environment loaded by the test system and vice versa.

Dec 16 '22 23:12 billsacks

Okay, all sounds good - thank you @slevisconsulting and @ekluzek . It might be good to confirm that this works in the context of a larger test suite, if you haven't already. This wouldn't necessarily need the full aux_clm, but you could create a text-based testlist file containing this test and a few others – with one or more tests before this one and one or more after it – to make sure this doesn't interfere with the python environment loaded by the test system and vice versa.

I ran aux_clm on cheyenne and it passed other than expected failures. I did not get motivated to try a custom list of tests :-)

Dec 19 '22 22:12 slevis-lmwg

I ran aux_clm on cheyenne and it passed other than expected failures. I did not get motivated to try a custom list of tests :-)

Awesome, thanks @slevisconsulting

Dec 19 '22 22:12 billsacks

Tested the latest commit in clm_pymods, as well as with ./create_test and both passed. @ekluzek please let me know if you agree with my implementation.

Dec 20 '22 01:12 slevis-lmwg

@ekluzek and @slevisconsulting agreed this morning (2023/1/24) that this PR will likely get merged/tagged as dev118, pending successful testing once I update to dev117.

Jan 24 '23 22:01 slevis-lmwg

@ekluzek I think something has changed in my permissions here. I updated to dev117, but the push to the PR has been failing. If nothing has changed on your side, we may need to troubleshoot tomorrow morning after Stand-up.

Feb 03 '23 00:02 slevis-lmwg

Have fsurdat_modifier system test use conda environment rather than ncar_pylib

Description of changes

Specific notes