NUOPC restart change
From email from Jim Edwards:
In PR https://github.com/ESCOMP/CMEPS/pull/221 I added support for the REST_OPTION=end and for writing restart files at the end of the run in any case except when REST_OPTION = none or never.
Each component model will need updates in the xxx_comp_nuopc.F90 file order to implement this change. Please consider making this change in your next component tag. Let me know if you have any questions or concerns. thanks
Here are the changes I made in the ctsm lnd_comp_nuopc.F90 file:
(Declare a module variable)
logical :: write_restart_at_endofrun = .false.
(Initialize the variable in InitializeRealize)
+ call NUOPC_CompAttributeGet(gcomp, name="write_restart_at_endofrun", value=cvalue, isPresent=isPresent, isSet=isSet, rc=rc)
+ if (ChkErr(rc,__LINE__,u_FILE_u)) return
+ if (isPresent .and. isSet) then
+ if (trim(cvalue) .eq. '.true.') write_restart_at_endofrun = .true.
+ end if
(Check the variable in ModelAdvance - Note here that this comes after the alarm_stop)
!--------------------------------
! Determine if time to stop
!--------------------------------
call ESMF_ClockGetAlarm(clock, alarmname='alarm_stop', alarm=alarm, rc=rc)
if (ChkErr(rc,__LINE__,u_FILE_u)) return
if (ESMF_AlarmIsRinging(alarm, rc=rc)) then
if (ChkErr(rc,__LINE__,u_FILE_u)) return
nlend = .true.
call ESMF_AlarmRingerOff( alarm, rc=rc )
if (ChkErr(rc,__LINE__,u_FILE_u)) return
else
nlend = .false.
endif
!--------------------------------
! Determine if time to write restart
!--------------------------------
rstwr = .false.
if (nlend .and. write_restart_at_endofrun) then
rstwr = .true.
else
call ESMF_ClockGetAlarm(clock, alarmname='alarm_restart', alarm=alarm, rc=rc)
if (ChkErr(rc,__LINE__,u_FILE_u)) return
if (ESMF_AlarmIsCreated(alarm, rc=rc)) then
if (ESMF_AlarmIsRinging(alarm, rc=rc)) then
if (ChkErr(rc,__LINE__,u_FILE_u)) return
rstwr = .true.
call ESMF_AlarmRingerOff( alarm, rc=rc )
if (ChkErr(rc,__LINE__,u_FILE_u)) return
endif
endif
end if
@cacraigucar, I tried to implement these changes. However the changes cause restarts to fail when the cmeps mediator is used. I don't have time to deal with this. Can this issue be assigned to someone else?
@fvitt - Yes, you do not need to include this in your PR
@jedwards4b - Francis notes that these changes cause restart failures. Any ideas?
From email from @jedwards4b
The ERS test is broken by this change, I believe that it's a failing of this test and maybe a couple of others that needs to be corrected, however if you are using cmeps0.13.23 or newer it should work.
I know that implementing this change was outside of Francis's assignment but providing minimal information about what component versions were used and what test(s) failed would be really helpful.
From email from @fvitt
Maybe the cmeps tag was too old. My sandbox was using cmeps0.13.13
These are the restart tests I tried: ERP_D_Vnuopc_Ln9.f09_f09_mg17.F2000climo.cheyenne_intel.cam-outfrq9s ERP_Vnuopc_Ln9.f09_f09_mg17.FC2000climo.cheyenne_intel.cam-outfrq9s
Should have been closed by #423 (but bad formatting above, oops).
This didn't actually get fixed as it was claimed. So this still needs to happen.
In case it's helpful here's how we did this in MOSART:
https://github.com/ESCOMP/MOSART/pull/124
@peverwhee @cacraigucar I'll let you decide milestone and priority and all that for this.
@ekluzek I think that if you are going to reopen an old issue like this you document the conditions required to reproduce the problem you encountered and the model version you used. This ticket was originally closed 3 years ago.
@ekluzek I think that if you are going to reopen an old issue like this you document the conditions required to reproduce the problem you encountered and the model version you used. This ticket was originally closed 3 years ago.
@jedwards4b I do realize this was closed several years ago. But, from what I could tell it shouldn't have been closed.
To describe more on how I came to that conclusion. What I did was to search on "write_restart_at_endofrun" in CAM and found it nowhere in the code. I then looked at atm_comp_nuopc to see how the restart writing is done, which is determined by the rstwr variable in ModelAdvance. That is ONLY set to .true. if the restart alarm is ringing, and there is no logic for it to be also set when the driver write_restart_at_endofrun is set. So I opened this by doing code inspection.
I also looked into the history of this a little bit and it looked like this change was intended to be included in a PR about changing the testing to NUOPC, in #423 -- but wasn't actually done. I assumed it was included as part of a larger PR and was thought to be completed, but actually wasn't. That is certainly something I've done before. I looked a little more and there are several PR's that work on the change in testing to NUOPC, and the ones I looked at didn't include changes that would address this. We also talked about this a bit at the CSEG meeting.
You are right that I didn't run a test to show this -- but I didn't feel that was something I should be responsible for. And figure that is something that someone that will actually fix this might do.
But, also to ask repository owners about how I should've handled this -- @cacraigucar what is your preference for what I should do in the future in cases like this? Should I let someone from CAM actually reopen the issue? There's several different ways this could be handled that I can think of. I'm happy to comply with guidelines that you give me...
I just ran test SMS.ne30pg3_ne30pg3_mt232.FHISTC_Ltso.derecho_intel using cam6_4_107 I set REST_OPTION=END and ran once. I see that the restart files are correctly written at the end of run. I then set CONTINUE_RUN=True and ran again. It correctly restarted and ran another 5 days. So I find that I must once again request that you tell me exactly what the problem is that you think you see @ekluzek?
I'm not sure if there is something which needs to be done or not. Francis had reported several years ago that a couple of restart tests failed, and I ran ERP_D_Vnuopc_Ln9.f09_f09_mg17.F2000climo.derecho_intel.cam-outfrq9s on cam6_4_107 and it works fine. Is this because we haven't included the code which Jim indicated we should add (listed at this top of this issue)? Is this still something which needs to be done?
Also, @ekluzek asked how this should be handled in the future, if he sees a problem in CAM. I would suggest that a new issue be opened detailing what the problem is and a link to the closed issue can be included in the description.
I just ran test SMS.ne30pg3_ne30pg3_mt232.FHISTC_Ltso.derecho_intel using cam6_4_107 I set REST_OPTION=END and ran once. I see that the restart files are correctly written at the end of run. I then set CONTINUE_RUN=True and ran again. It correctly restarted and ran another 5 days. So I find that I must once again request that you tell me exactly what the problem is that you think you see @ekluzek?
Hmmm. OK, the test you did was insufficient to show the problem -- which is if CAM listens to the driver namelist variable: write_restart_at_endofrun Which it doesn't -- because as I stated previously the code segment that is talked about in the original message is no where in CAM.
@jedwards you implicitly seem to reject that code inspection is a legitimate reason to address issues? Which if we as a group in CESM want to operate that way -- I can follow. But, in this case I argue it's a legitimate thing to do. This issue was also about adding code that you recommended every component add in -- 4 years ago, but now don't seem to want to see if that was actually done? I'm thinking you and I should setup a time to meet to hash this over a bit before continuing much here....
But, to follow the idea of showing a test that demonstrates the problem. I started with your case of SMS.ne30pg3_ne30pg3_mt232.FHISTC_Ltso.derecho_intel in cesm3.0-alphabranch (so the pre version of cesm3_0_alpha07c) which demonstrates the problem if you then add
write_restart_at_endofrun = .true.
to user_nl_cpl and run it. You'll see that it only writes out restart files for components -- other than CAM.
Here's a testcase that I did that demonstrates the problem. I setup a B case that has all active components.
/glade/derecho/scratch/erik/cesm3_0_alpha07c/cime/scripts/SMS.ne30pg3_t232_gris4.1850C_CAM70%LT_CLM60%BGC-CROP_CICE_MOM6_MOSART_CISM2%GRIS-EVOLVE_WW3.derecho_intel.20250818_204846_da54ut
and do the same thing. And as I see from code inspection all components other than: CAM, CISM, and WWW write out restart files. Since, CAM, CISM, and WWW don't have the bit of code talked about at the start of this message.
So @jedwards4b let's talk some more, but please also look at the test case above, and the CAM code and see if you don't agree with my conclusion?
Also, @ekluzek asked how this should be handled in the future, if he sees a problem in CAM. I would suggest that a new issue be opened detailing what the problem is and a link to the closed issue can be included in the description.
See my response above to see a case that fails. What I've figured out now, is that we don't have a test case that sufficiently tests the write_restart_at_endofrun namelist item. In CSEG we also had decided this isn't important enough to fix in the release -- even if it meant that ERR tests don't work. But, it turns out that they don't rely on this driver namelist item -- so they are working.
From above the reason I opened this issue again is that the code that was asked for in the start of this message -- wasn't installed into CAM. Which is why I thought it made sense to open this old issue. In the future though I'll open a new issue and reference the old one. That makes sense as a process, especially before I showed something that demonstrates the problem.
Since, I reopened this one -- another question is should we leave this open -- or should I open a new one? I'm good with either.
Thank you @ekluzek for clearly explaining the issue, that helps.