CAM
CAM copied to clipboard
HEMCO restart tests fail
What happened?
When HEMCO emissions are used restarts are not bit-for-bit the same as continuous runs. Regression testing give this failure: FAIL ERP_Ln9_Vnuopc.f09_f09_mg17.FCSD_HCO.cheyenne_intel.cam-outfrq9s COMPARE_base_rest
Until this is fixed, HEMCO configurations will not be supported.
What are the steps to reproduce the bug?
ERP_Ln9_Vnuopc.f09_f09_mg17.FCSD_HCO.cheyenne_intel.cam-outfrq9s COMPARE_base_rest
What CAM tag were you using?
cam6_3_118
What machine were you running CAM on?
CISL machine (e.g. cheyenne)
What compiler were you using?
Intel
Path to a case directory, if applicable
No response
Will you be addressing this bug yourself?
No
Extra info
@jimmielin will need to provide hemco external updates to address this reproducibility / restart issue.
Following up about the HEMCO restart issue, I figured out the problem is vertical regridding of 3D emissions using instantaneous surface pressure. The specific problem is that the minimum read frequency for HEMCO is 1 hour on the hour, i.e. 0:00, 1:00, etc., but restarted runs which use a sub-hour timestep, e.g. 30 min, begin off the hour, i.e. 0:00 first run (okay), but then 3:30, 6:30, etc. I am able to eliminate differences by changing the timestep to 1 hour. This results in vertical regridding using identical pressures across all runs since no regridding occurs off the hour in any of the runs.
Solutions within HEMCO could include storage of raw file data within HEMCO to be regridded each timestep, or sub-hour reads. Neither are great from a performance point-of-view. Another option, which is simpler but would impact you, is having restarted runs begin on the hour rather than the first timestep after the hour. This is what we do in offline GEOS-Chem and within GEOS and hence we have never had this problem before. Note that all of my testing using CAM-chem not GEOS-Chem. Is there a reason you start the subsequent runs at the timestep after the time corresponding to the restart file (cam.r.*)?
To be more clear about how runs are currently split up in time within CAM, here is a breakdown of my own 30 min timestep runs.
Compset: FCHIST_HCO Config params: REST_N=3, STOP_N=3, STOP_OPTION=nhours, ATM_NCPL=48
Run A – seven full timesteps (in GEOS-Chem CTM, HEMCO standalone, and GEOS this would only be six full timesteps) nstep 0 - 00:00. <- HEMCO 3D read/regrid nstep 1 – 00:30 nstep 2 – 01:00. <- HEMCO 3D read/regrid nstep 3 – 01:30 nstep 4 – 02:00. <- HEMCO 3D read/regrid nstep 5 – 02:30 nstep 6 – 03:00 <- HEMCO 3D read/regrid nstep 7 – 03:30 (partial)
Run B (restarted) – six full timesteps nstep 7 – 03:30. <- HEMCO 3D read/regrid nstep 8 – 04:00. <- HEMCO 3D read/regrid nstep 9 – 04:30 nstep10 – 05:00 <- HEMCO 3D read/regrid nstep 11 – 05:30 nstep 12 – 06:00 <- HEMCO 3D read/regrid nstep 13 – 06:30 (partial)
All additional subsequent runs are also six full timesteps. Why does the initial run have seven? Using six would solve the HEMCO problem. It is also more intuitive for a 3 hour run with 30 minute time intervals to execute 6 full timesteps.
Another solution could be to store the last hour's edge pressures in the restart file and use those for the initial regrid upon restart. A cleaner solution, however, is to start runs that are restarted at integer number of hours on the hour rather than offset.
That's great news that you tracked down this issue, @lizziel.
@brian-eaton I don't remember why the model integrates for a half time-step past STOP_N. Can you refresh my memory? Thanks!
In case it helps, here is what the execution would look like to fix the problem:
Run A – six full timesteps nstep 0 - 00:00 <- HEMCO 3D read/regrid nstep 1 – 00:30 nstep 2 – 01:00 <- HEMCO 3D read/regrid nstep 3 – 01:30 nstep 4 – 02:00 <- HEMCO 3D read/regrid nstep 5 – 02:30 nstep 6 – 03:00 (partial)
Run B (restarted) – six full timesteps nstep 6 – 03:00 <- HEMCO 3D read/regrid nstep 7 – 03:30 nstep 8 – 04:00 <- HEMCO 3D read/regrid nstep 9 – 04:30 nstep10 – 05:00 <- HEMCO 3D read/regrid nstep 11 – 05:30 nstep 12 – 06:00 (partial)
etc.
The ERP_Ln9_* test does an initial run of 5 steps and then does a 4 step restart. For a model configuration with a 1/2 hour timestep, that means the model is restarting at 9000 seconds (you can see this in the timestamp of the restart file) and so HEMCO is initializing to a different state for timestep 6 than it would have if it were doing a continuous 6 timestep run. This clearly won't pass the restart test.
The fact that CAM's timeloop does some calculations (and writes a confusing message about it to the log file) after the restart file is written and before control returns to the coupler is not relevant to this restart failure. The calculations that are done after the restart file is written are redone during the initialization phase of a restart run (don't ask why:).
CAM has historically required being able to restart from an arbitrary timestep. The benefit of that is mainly for debugging purposes. It sounds like HEMCO could be made to restore its state at a non-hour boundary if it saved the edge pressures from the correct hour boundary to the restart file as suggested by Lizzie. I think this is the best solution if we want to keep the requirement that CAM be able to restart from an arbitrary timestep.
In the meantime I think we should adjust the restart tests with HEMCO to only restart on an hour boundary. I don't know the cime testing infrastructure well enough to know the best way to do this, but I assume it's not hard.
One thing keep on the radar though is that the following tests seem to randomly fail baseline comparisions, and this is not related to restarts: SMS_Ld1.f09_f09_mg17.FCHIST_GC.derecho_intel.cam-outfrq1d (Overall: DIFF) SMS_Lh12.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq3h (Overall: DIFF)
It sounds like HEMCO could be made to restore its state at a non-hour boundary if it saved the edge pressures from the correct hour boundary to the restart file as suggested by Lizzie. I think this is the best solution if we want to keep the requirement that CAM be able to restart from an arbitrary timestep.
From a practical and user-friendly perspective, this seems like a requirement we should keep! We don't need the added hassle of making sure STOP_N lands on the hour for any arbitrary ATM_NCPL, especially if this becomes the default emissions driver in CAM7. Thanks for the clarification Brian.
@brian-eaton - I would suggest we NOT change the HEMCO test so that it uses a setup which passes. Since HEMCO is targeted to be a default setup in CESM, I want to make sure we remember it is broken until it is fixed. I know it is a pain to have a continuously failing regression test, but in this case, I think it is important to keep it failing until a fix is provided for HEMCO.
I would argue that having a test that fails for a known reason can mask an unanticipated failure. I think tests that pass are more informative. A way to keep track of a feature we want that doesn't exist is through an open issue.
I support saving the last hour edge pressure to use with HEMCO so that runs are allowed to start/stop off the hour.
Separate from this I am still confused about why my first run completed 7 full timesteps plus 1 partial. In my tests I did not use the ERP in the test suite but instead manually created two cases. One of the tests used STOP_N set to 3 while the other had it set to 6. I ran the one with STOP_N=3 twice, first using initial values second using CONTINUE_RUN=TRUE. Unlike the ERP test I would think this would do an even number of 30min timesteps. Is there a namelist parameter I am not setting correctly?
@lizziel, if your tests are on derecho and you point me to them I'll take a look.
@brian-eaton, thank you for taking a look. My single (6hr) and double (two 3hr) runs, both 30min and 1hr timesteps, are archived here:
/glade/derecho/scratch/elundgren/case.FCHIST_HCO_single/run_aeic2019_NO_30min_timestep
/glade/derecho/scratch/elundgren/case.FCHIST_HCO_single/run_aeic2019_NO_1hr_timestep
/glade/derecho/scratch/elundgren/case.FCHIST_HCO_double/run_aeic2019_NO_30min_timestep
/glade/derecho/scratch/elundgren/case.FCHIST_HCO_double/run_aeic2019_NO_1hr_timestep
I wrote a script to compare the cam.r.* files. Results are below. The first few lines show the files being compared.
Results of 30min timestep test:
~/scratch/case.FCHIST_HCO_double $ ./compare_files.sh aeic2019_NO_30min_timestep
../case.FCHIST_HCO_single/run_aeic2019_NO_30min_timestep/case.FCHIST_HCO_single.cam.r.2010-01-01-10800.nc
../case.FCHIST_HCO_single/run_aeic2019_NO_30min_timestep/case.FCHIST_HCO_single.cam.r.2010-01-01-21600.nc
run_aeic2019_NO_30min_timestep/case.FCHIST_HCO_double.cam.r.2010-01-01-10800.nc
run_aeic2019_NO_30min_timestep/case.FCHIST_HCO_double.cam.r.2010-01-01-21600.nc
A total number of 625 fields were compared
of which 0 had non-zero differences
and 0 had differences in fill patterns
and 0 had different dimension sizes
and 0 had different data types
A total number of 17 fields could not be analyzed
A total number of 0 fields on file 1 were not found on file 2.
A total number of 0 fields on file 2 were not found on file 1.
diff_test: the two files seem to be IDENTICAL
A total number of 625 fields were compared
of which 465 had non-zero differences
and 0 had differences in fill patterns
and 0 had different dimension sizes
and 0 had different data types
A total number of 17 fields could not be analyzed
A total number of 0 fields on file 1 were not found on file 2.
A total number of 0 fields on file 2 were not found on file 1.
diff_test: the two files seem to be DIFFERENT
Results of 1hr timestep test:
~/scratch/case.FCHIST_HCO_double $ ./compare_files.sh aeic2019_NO_1hr_timestep/
../case.FCHIST_HCO_single/run_aeic2019_NO_1hr_timestep//case.FCHIST_HCO_single.cam.r.2010-01-01-10800.nc
../case.FCHIST_HCO_single/run_aeic2019_NO_1hr_timestep//case.FCHIST_HCO_single.cam.r.2010-01-01-21600.nc
run_aeic2019_NO_1hr_timestep//case.FCHIST_HCO_double.cam.r.2010-01-01-10800.nc
run_aeic2019_NO_1hr_timestep//case.FCHIST_HCO_double.cam.r.2010-01-01-21600.nc
A total number of 625 fields were compared
of which 0 had non-zero differences
and 0 had differences in fill patterns
and 0 had different dimension sizes
and 0 had different data types
A total number of 17 fields could not be analyzed
A total number of 0 fields on file 1 were not found on file 2.
A total number of 0 fields on file 2 were not found on file 1.
diff_test: the two files seem to be IDENTICAL
A total number of 625 fields were compared
of which 0 had non-zero differences
and 0 had differences in fill patterns
and 0 had different dimension sizes
and 0 had different data types
A total number of 17 fields could not be analyzed
A total number of 0 fields on file 1 were not found on file 2.
A total number of 0 fields on file 2 were not found on file 1.
diff_test: the two files seem to be IDENTICAL
Two consecutive run logs using 30min timestep, with grep results
The log for the first 30min timestep run:
~/scratch/case.FCHIST_HCO_double/run_aeic2019_NO_30min_timestep $ grep "nstep\|Set HEMCO clock" atm.log.6611587.desched1.241113-103513
nstep, te 0 0.25981444534833994E+10 0.25981444534833994E+10 0.00000000000000000E+00 0.98517093638948078E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 00:00:00
nstep, te 1 0.25978798502492533E+10 0.25979003402337537E+10 0.11356759501907610E-02 0.98516505232345444E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 00:30:00
nstep, te 2 0.25978850270017157E+10 0.25978866429912839E+10 0.89567667871924043E-04 0.98516524618494484E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 01:00:00
nstep, te 3 0.25978674995376434E+10 0.25978698021196389E+10 0.12762267041154914E-03 0.98516522409252182E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 01:30:00
nstep, te 4 0.25978178776109223E+10 0.25978193671641116E+10 0.82559805615334378E-04 0.98516539872440655E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 02:00:00
nstep, te 5 0.25977797488725424E+10 0.25977796423065767E+10 -0.59065092744659974E-05 0.98516603353383412E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 02:30:00
nstep, te 6 0.25977395788309808E+10 0.25977398829856963E+10 0.16858025724902301E-04 0.98516650287634140E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 03:00:00
nstep, te 7 0.25977030025441728E+10 0.25977034453752980E+10 0.24544269848086294E-04 0.98516690941469933E+05 0.22552395239472389E+03
The log for the second 30min timestep run:
~/scratch/case.FCHIST_HCO_double/run_aeic2019_NO_30min_timestep $ grep "nstep\|Set HEMCO clock" atm.log.6611639.desched1.241113-104502
nstep, te 7 0.25977030025441728E+10 0.25977034453752980E+10 0.24544269848086294E-04 0.98516690941469933E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 03:30:00
nstep, te 8 0.25976659331167064E+10 0.25976665821389523E+10 0.35972566028214541E-04 0.98516723320908553E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 04:00:00
nstep, te 9 0.25976323912652130E+10 0.25976328323850460E+10 0.24449402782272903E-04 0.98516760575487977E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 04:30:00
nstep, te 10 0.25975979038978596E+10 0.25975985615967813E+10 0.36453453032763774E-04 0.98516789498765429E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 05:00:00
nstep, te 11 0.25975666961468062E+10 0.25975673227889829E+10 0.34732099360906552E-04 0.98516819002134420E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 05:30:00
nstep, te 12 0.25975351245179186E+10 0.25975358664776978E+10 0.41123650966508243E-04 0.98516844262055791E+05 0.22552395239472389E+03
Set HEMCO clock to 2010-01-01 06:00:00
nstep, te 13 0.25975072106202283E+10 0.25975077808493266E+10 0.31605345496482084E-04 0.98516874617956593E+05 0.22552395239472389E+03
A 3rd run (not saved) starts at 6:30, consistent with the second run above. But the first run includes an extra timestep, which throws subsequent runs off to start on the off-hour.
Note that I am getting the time based on HEMCO clock messages. Could the HEMCO clock be incorrect? The HEMCO clock is displayed before the read/regrid. Perhaps it is a half-hour late upon initialization? I am particularly interested to know if nstep 0 in the very first run is a sort of dummy timestep in which case nstep 1 would be 00:00 and indicate the HEMCO clock is indeed off.
I also want to note, in case anyone tries to replicate this, that I am using a modified HEMCO configuration input file. The default is to read these emissions monthly. I updated it to read hourly.
Hi @lizziel. I've looked at your runs and see that you only had a successful restart when you changed the cam timestep to 1hr. I misunderstood your previous comments to imply that if cam restarted on an hour boundary then the restart would be successful. But that does not appear to be the case. Your run in ~elundgren/scratch/case.FCHIST_HCO_double/run_aeic2019_NO_30min_timestep using a 1/2 hr timestep does a 6 step initial run followed by a 6 step restart run. Your runs aren't producing any history output since the default of monthly average output is used. But if, for example, every timestep were output (nhtfrq=1), then you'd see that the initial 6 step run outputs 6 time samples with the last one ending at hour 3.0. The restart file contains the timestamp 2010-01-01-10800 indicating that the model will restart from hour 3.0. But as you've noticed the restart file does not contain the exact model state corresponding to hour 3.0, but rather contains state that is
part way through the timestep that starts at hour 3.0. In the restart run, the restart file is read and the step from hour 3.0 to hour 3.5 is finished during the initialization phase of that run.
The lines of output in the cam.log file beginning with 'nstep, te' are total energy diagnostics. This output is not issued at the end of a timestep. It is issued after the dycore runs (from the call to check_energy_gmean).
CAM's internal clock is initialized to the initial time for the run and nstep is set to 0. nstep remains 0 during the initialization phase which includes some convective adjustments to the initial state. The first time ModelAdvance is called, nstep is still 0, and the call to wshist will write the state that is stored in the history buffers from any outfld calls made while nstep=0. This output was on by default until fairly recently (cam6_4_141), but is now only done if requested since it is not the result of a model timestep. Rather it is just what convective adjustments have done to the initial conditions. After this CAM's internal clock is updated (by the call to advance_timestep) to the time at the end of the first timestep, and nstep is set to 1. It looks from the log output posted that the HEMCO clock is in synch with CAM's, although I can't tell when it is updated relative to CAM's update.
I think the bottom line here is that since HEMCO's state depends on CAM's state, the way to restart exactly will be to write either HEMCO's state, or the CAM state needed by HEMCO to recompute its state, to the restart file. This is a common thing for CAM's physics parameterizations to do. For example mozart chemistry uses the routines chem_init_restart, chem_write_restart, and chem_read_restart for this purpose.
I hope this helps. I'm happy to clarify further as needed.
I've looked at your runs and see that you only had a successful restart when you changed the cam timestep to 1hr. I misunderstood your previous comments to imply that if cam restarted on an hour boundary then the restart would be successful. But that does not appear to be the case.
Correct. However, I believe that if cam is restarted on an hour boundary it should be successful, since the emissions would be read and regridded in the restarted run in the same way as a continuous run without restart. I am trying to determine why this is not the case to ensure there is not a bug in HEMCO.
HEMCO treats clock time as start of timestep until the timestep is complete. From your response it sounds like CAM time is set to end of timestep when the first timestep (nstep 1) begins, which would conflict with HEMCO's interpretation of time as start of timestep. This would explain why the start time of HEMCO clock in the restart run is 3:30 rather than 3:00. Could you confirm?
Yes, CAM sets the "current" time to be at the end of the timestep interval. What's confusing is that it doesn't advance the clock at the beginning of a new timestep. Rather, it advances the clock after a completed timestep has been (or could have been) written to the history file. I'm not sure whether or not that distinction is important for this discussion.
During the final call to ModelAdvance in the initial 6 step run, the restart file is written with state representing a partially done step 7, and advance_timestep is called which sets nstep=7 and the current time to
the end of step 7, i.e., 3:30.
During the initialization phase of the restart run, CAM's clock is set to match its state at the end of the previous run. In this example the clock is set to step 7 and the current time is set to 3:30. This is shown in the * Time Manager Configuration * output in the log file. If the HEMCO clock is set to the time at the beginning of the timestep, then it should be set to 3:00 at the beginning of the restart run. Does that make sense?
Yes, this makes sense. I will work on the HEMCO clock to fix this since it is indeed a bug.
Setting the HEMCO clock to the previous time indeed fixes the issue in my 3hr tests when using 30min timesteps. I do not expect the 9s ERP test to pass but would expect a 3hr one to pass in the test suite. Until we have a comprehensive fix would you be okay if we add such a test to verify one passes but the other fails? I will do it locally and can add it to my PR.
I will prepare a PR with my updates. In the meantime, could someone copy the following to inputdata? I do not have permissions to put files there.
Copy: /glade/u/home/elundgren/config/HEMCO_Config.CC.CEDS_AEIC19.NEx.c241122.rc
To: /glade/campaign/cesm/cesmdata/cseg/inputdata/atm/cam/geoschem/emis
I agree that we should add a restart test that works.
I copied the HEMCO_Config file to the inputdata location.
@brian-eaton, could you point me to where the first run of the ERP/ERS tests is set to 5 timesteps? I tried changing the total from 9 timesteps to 8 in the testdefs files but the result is still 5 timesteps for the first leg. Is it somewhere in cime?
I don't know. I've looked at the cime testing documentation and don't see a way to control this using a standard test modifier. @jedwards4b, do you know, or can you point us in the right direction?
I think you might have to define your own test type (not unheard of for CAM) because I think the issue is in
<cime>/CIME/SystemTests/system_tests_common.py in the function, _set_restart_interval in these calculations:
stop_n = int(stop_n * factor // coupling_secs)
rest_n = math.ceil((stop_n // 2 + 1) * coupling_secs / factor)
Does a 10 step test result in a 6 step run + a 4 step restart?
Thanks @gold2718! I think we are using different cime versions since mine does not have what you show. However, I found the equivalent code which sets REST_N in CIME/data/config/config_tests.xml.
CIME/data/config/config_tests.xml:543: <REST_N>$STOP_N / 2 + 1</REST_N>
Each of the different test types has the definition of REST_N there. I am doing a test by changing the existing ERS test REST_N. For a solution to submit I will create new test, ERS3 (ERS2 is taken). Do I also need to make an ers3.py file within CIME/SystemTests? I assume I then reference it in testdefs with something like 'ERS3_Ln9'.
@brian-eaton, I initially tested the equivalent with an 8 step test, with hopes it would be 4+4. It was still 5 I believe because REST_N is set as STOP_N / 2 + 1, as @gold2718 found and I did also in a separate location.
Aha, @brian-eaton, I see now what you are saying. 10 / 2 + 1 is 6. I'll try that. Thanks!
Doing a 10 step test indeed worked.
~/scratch $ cd ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s.20241204_083007_0fm9b0/
~/scratch/ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s.20241204_083007_0fm9b0 $ cat TestStatus
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s CREATE_NEWCASE
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s XML
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s SETUP
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s SHAREDLIB_BUILD time=248
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s MODEL_BUILD time=409
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s SUBMIT
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s RUN time=614
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s COMPARE_base_rest
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s MEMLEAK insufficient data for memleak test
PASS ERS_Ln10.f09_f09_mg17.FCSD_HCO.derecho_intel.cam-outfrq10s SHORT_TERM_ARCHIVER
You can expect a PR from me with the fix soon.
@lizziel Just returning from vacation and trying to catch up on the status of this issue - if I understand correctly the test will pass only if you specify a specific number of timesteps? (10 in this case) - that doesn't seem like an acceptable outcome to me - it should be able to pass regardless of the number of steps.
Hi Jim. The issue is that HEMCO currently only restarts exactly from an hour boundary. Changing the number of steps to 10 allowed that to happen. We still have the 9 step test that fails to keep it on our radar.
Got it. Thanks for the clarification.
Please note the fix I came up with solves the issue with ERS tests. ERP still fails which needs further investigation. That test is the same as ERS but changes the tasks/threading counts on restart.