Resubmission fails on Betzy
Describe the bug Automatically resubmitting simulations fail on Betsy in NorESM2.5 alpha02
- NorESM version: noresm2_5_alpha02_v3
- HPC platform: Betzy
- Compiler (if applicable):
- Compset (if applicable):
- Resolution (if applicable):
- Error message (if applicable):
To Reproduce Steps to reproduce the behavior:
- set RESUBMIT = 1
Expected behavior I would expect the simulation to be automatically resubmitted. This was an issue in earlier versions of NorESM2 (up to NorESM2.0.3 at least ) and it was fixed, but it is not included in the docu: https://noresm-docs.readthedocs.io/en/noresm2/access/releases_noresm20.html
Screenshots If applicable, add screenshots to help explain your problem.
Additional context Add any other context about the problem here.
we have added following lines in scripts/lib/CIME/case/case_st_archive.py os.unsetenv('SLURM_MEM_PER_GPU') os.unsetenv('SLURM_MEM_PER_CPU')
whcih has earlier solved problem. I will check it probably, NorESM-2.5 does not have it or require only os.unsetenv('SLURM_MEM_PER_CPU') or might be different way it could be eliminated.
@mvertens check it: https://github.com/NorESMhub/cime/commit/b4e36e71f6351183c033a8e248ac5b4912ec2686#diff-23863a14bd10dceddea483ea7c55fc9e902a619cf2fa32d31f045c7063378e5b
only need to modify: scripts/lib/CIME/case/case_st_archive.py unsetting these two
if self.get_value("MACH") == "betzy": logger.info("remove environment variable") os.unsetenv('SLURM_MEM_PER_GPU') os.unsetenv('SLURM_MEM_PER_CPU') self.submit(resubmit=True)
Using RESUBMIT I get the following when DOUT_S is TRUE:
Traceback (most recent call last):
File "/var/spool/slurmd/job970422/slurm_script", line 106, in <module>
_main_func(__doc__)
File "/var/spool/slurmd/job970422/slurm_script", line 97, in _main_func
success = case.case_st_archive(last_date_str=last_date,
File "/cluster/projects/nn10013k/mvertens/src/noresm2_5_alpha04_v2/cime/CIME/case/case_st_archive.py", line 1021, in case_st_archive
os.makedirs(dout_s_root)
File "/cluster/software/Python/3.9.6-GCCcore-11.2.0/lib/python3.9/os.py", line 215, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/cluster/software/Python/3.9.6-GCCcore-11.2.0/lib/python3.9/os.py", line 215, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/cluster/software/Python/3.9.6-GCCcore-11.2.0/lib/python3.9/os.py", line 215, in makedirs
makedirs(head, exist_ok=exist_ok)
[Previous line repeated 2 more times]
File "/cluster/software/Python/3.9.6-GCCcore-11.2.0/lib/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/nird'
I am trying to archive directly to /nird/datalake. However, I can activate case.st_archive successfully in my $CASEROOT with no problem to /nird/datalake.
@monsieuralok - I think you are pointing to an older version of CIME. Versions of CIME being used in noresm_develop. ./CIME/case/case_st_archive.py no longer contains any references to machines.
Is /nird/datalake mounted on the compute nodes on betzy? I thought /nird was not available, but maybe this is outdated knowledge.
@mvertens it says "OSError: [Errno 30] Read-only file system: '/nird'" it could be for compute node "/nird" could be read only file system. But, I have to check it with Betzy.
@monsieuralok @TomasTorsvik - that makes sense. I can try to run more tests to determine if the resubmit works when I archive directly on betzy. My experience was that it never worked on my production runs - even when I was doing short term archiving on betzy.
@mvertens Have you tried more tests? Could you update if RESUBMIT is working or not?
@monsieuralok - with input from @mvdebolskiy the resubmit is now working. I have verified this with an ERR test. The code base I am using is not in noresmhub yet - but is feature/noresm2_5_alpha04_v3 in my noresm fork - https://github.com/mvertens/NorESM.git. In my production runs I wanted to directly archive on nird - but it looks like I can't do this from the compute nodes at this point.
@adagj - can this issue be closed now? Seems it has been resolved in a code update.
@TomasTorsvik - I think this issue can be closed.