CAM icon indicating copy to clipboard operation
CAM copied to clipboard

NUOPC restart change

Open cacraigucar opened this issue 4 years ago • 15 comments

From email from Jim Edwards:

In PR https://github.com/ESCOMP/CMEPS/pull/221 I added support for the REST_OPTION=end and for writing restart files at the end of the run in any case except when REST_OPTION = none or never.

Each component model will need updates in the xxx_comp_nuopc.F90 file order to implement this change. Please consider making this change in your next component tag. Let me know if you have any questions or concerns. thanks

Here are the changes I made in the ctsm lnd_comp_nuopc.F90 file:

(Declare a module variable)

logical :: write_restart_at_endofrun = .false.

(Initialize the variable in InitializeRealize)

+    call NUOPC_CompAttributeGet(gcomp, name="write_restart_at_endofrun", value=cvalue, isPresent=isPresent, isSet=isSet, rc=rc)
+    if (ChkErr(rc,__LINE__,u_FILE_u)) return
+    if (isPresent .and. isSet) then
+       if (trim(cvalue) .eq. '.true.') write_restart_at_endofrun = .true.
+    end if

(Check the variable in ModelAdvance - Note here that this comes after the alarm_stop)

   !--------------------------------                                                                                                              
   ! Determine if time to stop                                                                                                                    
   !--------------------------------                                                                                                              
                                                                                                                                                 
   call ESMF_ClockGetAlarm(clock, alarmname='alarm_stop', alarm=alarm, rc=rc)                                                                    
   if (ChkErr(rc,__LINE__,u_FILE_u)) return                                                                                                      
                                                                                                                                                 
   if (ESMF_AlarmIsRinging(alarm, rc=rc)) then                                                                                                    
      if (ChkErr(rc,__LINE__,u_FILE_u)) return                                                                                                    
      nlend = .true.                                                                                                                              
      call ESMF_AlarmRingerOff( alarm, rc=rc )                                                                                                    
      if (ChkErr(rc,__LINE__,u_FILE_u)) return                                                                                                    
   else                                                                                                                                          
      nlend = .false.                                                                                                                            
   endif                                                                                                                                          
                                                                                                                                                 
   !--------------------------------                                                                                                              
   ! Determine if time to write restart                                                                                                          
   !--------------------------------                                                                                                              
   rstwr = .false.                                                                                                                                
   if (nlend .and. write_restart_at_endofrun) then                                                                                                
      rstwr = .true.                                                                                                                              
   else                                                                                                                                          
      call ESMF_ClockGetAlarm(clock, alarmname='alarm_restart', alarm=alarm, rc=rc)                                                              
      if (ChkErr(rc,__LINE__,u_FILE_u)) return                                                                                                    
      if (ESMF_AlarmIsCreated(alarm, rc=rc)) then                                                                                                
         if (ESMF_AlarmIsRinging(alarm, rc=rc)) then                                                                                              
            if (ChkErr(rc,__LINE__,u_FILE_u)) return                                                                                              
            rstwr = .true.                                                                                                                        
            call ESMF_AlarmRingerOff( alarm, rc=rc )                                                                                              
            if (ChkErr(rc,__LINE__,u_FILE_u)) return                                                                                              
         endif                                                                                                                                    
      endif                                                                                                                                      
   end if                                                                                                                                        

cacraigucar avatar Aug 06 '21 16:08 cacraigucar

@cacraigucar, I tried to implement these changes. However the changes cause restarts to fail when the cmeps mediator is used. I don't have time to deal with this. Can this issue be assigned to someone else?

fvitt avatar Aug 11 '21 14:08 fvitt

@fvitt - Yes, you do not need to include this in your PR

@jedwards4b - Francis notes that these changes cause restart failures. Any ideas?

cacraigucar avatar Aug 11 '21 18:08 cacraigucar

From email from @jedwards4b

The ERS test is broken by this change, I believe that it's a failing of this test and maybe a couple of others that needs to be corrected, however if you are using cmeps0.13.23 or newer it should work.

I know that implementing this change was outside of Francis's assignment but providing minimal information about what component versions were used and what test(s) failed would be really helpful.

cacraigucar avatar Aug 16 '21 22:08 cacraigucar

From email from @fvitt

Maybe the cmeps tag was too old. My sandbox was using cmeps0.13.13

These are the restart tests I tried: ERP_D_Vnuopc_Ln9.f09_f09_mg17.F2000climo.cheyenne_intel.cam-outfrq9s ERP_Vnuopc_Ln9.f09_f09_mg17.FC2000climo.cheyenne_intel.cam-outfrq9s

cacraigucar avatar Aug 16 '21 22:08 cacraigucar

Should have been closed by #423 (but bad formatting above, oops).

gold2718 avatar Apr 29 '22 18:04 gold2718

This didn't actually get fixed as it was claimed. So this still needs to happen.

ekluzek avatar Aug 06 '25 06:08 ekluzek

In case it's helpful here's how we did this in MOSART:

https://github.com/ESCOMP/MOSART/pull/124

@peverwhee @cacraigucar I'll let you decide milestone and priority and all that for this.

ekluzek avatar Aug 06 '25 06:08 ekluzek

@ekluzek I think that if you are going to reopen an old issue like this you document the conditions required to reproduce the problem you encountered and the model version you used. This ticket was originally closed 3 years ago.

jedwards4b avatar Aug 12 '25 15:08 jedwards4b

@ekluzek I think that if you are going to reopen an old issue like this you document the conditions required to reproduce the problem you encountered and the model version you used. This ticket was originally closed 3 years ago.

@jedwards4b I do realize this was closed several years ago. But, from what I could tell it shouldn't have been closed.

To describe more on how I came to that conclusion. What I did was to search on "write_restart_at_endofrun" in CAM and found it nowhere in the code. I then looked at atm_comp_nuopc to see how the restart writing is done, which is determined by the rstwr variable in ModelAdvance. That is ONLY set to .true. if the restart alarm is ringing, and there is no logic for it to be also set when the driver write_restart_at_endofrun is set. So I opened this by doing code inspection.

I also looked into the history of this a little bit and it looked like this change was intended to be included in a PR about changing the testing to NUOPC, in #423 -- but wasn't actually done. I assumed it was included as part of a larger PR and was thought to be completed, but actually wasn't. That is certainly something I've done before. I looked a little more and there are several PR's that work on the change in testing to NUOPC, and the ones I looked at didn't include changes that would address this. We also talked about this a bit at the CSEG meeting.

You are right that I didn't run a test to show this -- but I didn't feel that was something I should be responsible for. And figure that is something that someone that will actually fix this might do.

But, also to ask repository owners about how I should've handled this -- @cacraigucar what is your preference for what I should do in the future in cases like this? Should I let someone from CAM actually reopen the issue? There's several different ways this could be handled that I can think of. I'm happy to comply with guidelines that you give me...

ekluzek avatar Aug 15 '25 18:08 ekluzek

I just ran test SMS.ne30pg3_ne30pg3_mt232.FHISTC_Ltso.derecho_intel using cam6_4_107 I set REST_OPTION=END and ran once. I see that the restart files are correctly written at the end of run. I then set CONTINUE_RUN=True and ran again. It correctly restarted and ran another 5 days. So I find that I must once again request that you tell me exactly what the problem is that you think you see @ekluzek?

jedwards4b avatar Aug 18 '25 14:08 jedwards4b

I'm not sure if there is something which needs to be done or not. Francis had reported several years ago that a couple of restart tests failed, and I ran ERP_D_Vnuopc_Ln9.f09_f09_mg17.F2000climo.derecho_intel.cam-outfrq9s on cam6_4_107 and it works fine. Is this because we haven't included the code which Jim indicated we should add (listed at this top of this issue)? Is this still something which needs to be done?

cacraigucar avatar Aug 18 '25 22:08 cacraigucar

Also, @ekluzek asked how this should be handled in the future, if he sees a problem in CAM. I would suggest that a new issue be opened detailing what the problem is and a link to the closed issue can be included in the description.

cacraigucar avatar Aug 18 '25 22:08 cacraigucar

I just ran test SMS.ne30pg3_ne30pg3_mt232.FHISTC_Ltso.derecho_intel using cam6_4_107 I set REST_OPTION=END and ran once. I see that the restart files are correctly written at the end of run. I then set CONTINUE_RUN=True and ran again. It correctly restarted and ran another 5 days. So I find that I must once again request that you tell me exactly what the problem is that you think you see @ekluzek?

Hmmm. OK, the test you did was insufficient to show the problem -- which is if CAM listens to the driver namelist variable: write_restart_at_endofrun Which it doesn't -- because as I stated previously the code segment that is talked about in the original message is no where in CAM.

@jedwards you implicitly seem to reject that code inspection is a legitimate reason to address issues? Which if we as a group in CESM want to operate that way -- I can follow. But, in this case I argue it's a legitimate thing to do. This issue was also about adding code that you recommended every component add in -- 4 years ago, but now don't seem to want to see if that was actually done? I'm thinking you and I should setup a time to meet to hash this over a bit before continuing much here....

But, to follow the idea of showing a test that demonstrates the problem. I started with your case of SMS.ne30pg3_ne30pg3_mt232.FHISTC_Ltso.derecho_intel in cesm3.0-alphabranch (so the pre version of cesm3_0_alpha07c) which demonstrates the problem if you then add

write_restart_at_endofrun = .true.

to user_nl_cpl and run it. You'll see that it only writes out restart files for components -- other than CAM.

Here's a testcase that I did that demonstrates the problem. I setup a B case that has all active components.

/glade/derecho/scratch/erik/cesm3_0_alpha07c/cime/scripts/SMS.ne30pg3_t232_gris4.1850C_CAM70%LT_CLM60%BGC-CROP_CICE_MOM6_MOSART_CISM2%GRIS-EVOLVE_WW3.derecho_intel.20250818_204846_da54ut

and do the same thing. And as I see from code inspection all components other than: CAM, CISM, and WWW write out restart files. Since, CAM, CISM, and WWW don't have the bit of code talked about at the start of this message.

So @jedwards4b let's talk some more, but please also look at the test case above, and the CAM code and see if you don't agree with my conclusion?

ekluzek avatar Aug 19 '25 05:08 ekluzek

Also, @ekluzek asked how this should be handled in the future, if he sees a problem in CAM. I would suggest that a new issue be opened detailing what the problem is and a link to the closed issue can be included in the description.

See my response above to see a case that fails. What I've figured out now, is that we don't have a test case that sufficiently tests the write_restart_at_endofrun namelist item. In CSEG we also had decided this isn't important enough to fix in the release -- even if it meant that ERR tests don't work. But, it turns out that they don't rely on this driver namelist item -- so they are working.

From above the reason I opened this issue again is that the code that was asked for in the start of this message -- wasn't installed into CAM. Which is why I thought it made sense to open this old issue. In the future though I'll open a new issue and reference the old one. That makes sense as a process, especially before I showed something that demonstrates the problem.

Since, I reopened this one -- another question is should we leave this open -- or should I open a new one? I'm good with either.

ekluzek avatar Aug 19 '25 05:08 ekluzek

Thank you @ekluzek for clearly explaining the issue, that helps.

jedwards4b avatar Aug 19 '25 12:08 jedwards4b