NorESM icon indicating copy to clipboard operation
NorESM copied to clipboard

Resubmission fails on Betzy

Open adagj opened this issue 1 year ago • 7 comments

Describe the bug Automatically resubmitting simulations fail on Betsy in NorESM2.5 alpha02

  • NorESM version: noresm2_5_alpha02_v3
  • HPC platform: Betzy
  • Compiler (if applicable):
  • Compset (if applicable):
  • Resolution (if applicable):
  • Error message (if applicable):

To Reproduce Steps to reproduce the behavior:

  1. set RESUBMIT = 1

Expected behavior I would expect the simulation to be automatically resubmitted. This was an issue in earlier versions of NorESM2 (up to NorESM2.0.3 at least ) and it was fixed, but it is not included in the docu: https://noresm-docs.readthedocs.io/en/noresm2/access/releases_noresm20.html

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

adagj avatar Jun 11 '24 09:06 adagj

we have added following lines in scripts/lib/CIME/case/case_st_archive.py os.unsetenv('SLURM_MEM_PER_GPU') os.unsetenv('SLURM_MEM_PER_CPU')

whcih has earlier solved problem. I will check it probably, NorESM-2.5 does not have it or require only os.unsetenv('SLURM_MEM_PER_CPU') or might be different way it could be eliminated.

monsieuralok avatar Jun 19 '24 08:06 monsieuralok

@mvertens check it: https://github.com/NorESMhub/cime/commit/b4e36e71f6351183c033a8e248ac5b4912ec2686#diff-23863a14bd10dceddea483ea7c55fc9e902a619cf2fa32d31f045c7063378e5b

only need to modify: scripts/lib/CIME/case/case_st_archive.py unsetting these two

if self.get_value("MACH") == "betzy": logger.info("remove environment variable") os.unsetenv('SLURM_MEM_PER_GPU') os.unsetenv('SLURM_MEM_PER_CPU') self.submit(resubmit=True)

monsieuralok avatar Aug 15 '24 08:08 monsieuralok

Using RESUBMIT I get the following when DOUT_S is TRUE:

Traceback (most recent call last):
  File "/var/spool/slurmd/job970422/slurm_script", line 106, in <module>
    _main_func(__doc__)
  File "/var/spool/slurmd/job970422/slurm_script", line 97, in _main_func
    success = case.case_st_archive(last_date_str=last_date,
  File "/cluster/projects/nn10013k/mvertens/src/noresm2_5_alpha04_v2/cime/CIME/case/case_st_archive.py", line 1021, in case_st_archive
    os.makedirs(dout_s_root)
  File "/cluster/software/Python/3.9.6-GCCcore-11.2.0/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/cluster/software/Python/3.9.6-GCCcore-11.2.0/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/cluster/software/Python/3.9.6-GCCcore-11.2.0/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  [Previous line repeated 2 more times]
  File "/cluster/software/Python/3.9.6-GCCcore-11.2.0/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/nird'

I am trying to archive directly to /nird/datalake. However, I can activate case.st_archive successfully in my $CASEROOT with no problem to /nird/datalake.

mvertens avatar Aug 15 '24 08:08 mvertens

@monsieuralok - I think you are pointing to an older version of CIME. Versions of CIME being used in noresm_develop. ./CIME/case/case_st_archive.py no longer contains any references to machines.

mvertens avatar Aug 15 '24 08:08 mvertens

Is /nird/datalake mounted on the compute nodes on betzy? I thought /nird was not available, but maybe this is outdated knowledge.

TomasTorsvik avatar Aug 15 '24 08:08 TomasTorsvik

@mvertens it says "OSError: [Errno 30] Read-only file system: '/nird'" it could be for compute node "/nird" could be read only file system. But, I have to check it with Betzy.

monsieuralok avatar Aug 15 '24 08:08 monsieuralok

@monsieuralok @TomasTorsvik - that makes sense. I can try to run more tests to determine if the resubmit works when I archive directly on betzy. My experience was that it never worked on my production runs - even when I was doing short term archiving on betzy.

mvertens avatar Aug 15 '24 08:08 mvertens

@mvertens Have you tried more tests? Could you update if RESUBMIT is working or not?

monsieuralok avatar Aug 23 '24 12:08 monsieuralok

@monsieuralok - with input from @mvdebolskiy the resubmit is now working. I have verified this with an ERR test. The code base I am using is not in noresmhub yet - but is feature/noresm2_5_alpha04_v3 in my noresm fork - https://github.com/mvertens/NorESM.git. In my production runs I wanted to directly archive on nird - but it looks like I can't do this from the compute nodes at this point.

mvertens avatar Aug 23 '24 14:08 mvertens

@adagj - can this issue be closed now? Seems it has been resolved in a code update.

TomasTorsvik avatar Sep 10 '24 12:09 TomasTorsvik

@TomasTorsvik - I think this issue can be closed.

mvertens avatar Sep 13 '24 14:09 mvertens