E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

CPL: Robustly recover from failures during restart-file writing.

Open ambrad opened this issue 1 year ago • 1 comments

It's possible for a crash that occurs during restart writing to lead to inconsistent or incomplete rpointer files. While we can't salvage the restart files in general, we can at least provide a consistent set of rpointer files -- namely, the previous ones -- for the next restart.

The rpointer manager provides this capability by copying all rpointer.X files to rpointer.X.prev before components write restart files, then removing rpointer.X.prev files when all components are done.

If a crash occurs midway through, on restart, the consistent rpointer.X.prev files will be used.

[BFB]

ambrad avatar Feb 20 '24 19:02 ambrad

I've tested this feature for efficacy in both E3SM and SCREAM. e3sm_integration passes (except for two tests that are also failing on the dashboard) on Chrysalis.

ambrad avatar Feb 20 '24 19:02 ambrad

I tested with a 30-day ultra-low res coupled run. Set REST_N to 5 days and saw the rpointer.mod.prev files appears and disappear around the time restarts were being written. The rpointer.mod.prev file is a little smaller because it does a better job of trimming whitespace.

[jacob@chrlogin1 run]$ ll rpoint*
-rw-r--r-- 1 jacob cels 344 Mar 26 14:45 rpointer.atm
-rw-r--r-- 1 jacob cels 328 Mar 26 14:46 rpointer.atm.prev
-rw-r--r-- 1 jacob cels 257 Mar 26 14:45 rpointer.drv
-rw-r--r-- 1 jacob cels  38 Mar 26 14:46 rpointer.drv.prev
-rw-r--r-- 1 jacob cels  21 Mar 26 14:45 rpointer.ice
-rw-r--r-- 1 jacob cels  21 Mar 26 14:46 rpointer.ice.prev
-rw-r--r-- 1 jacob cels 257 Mar 26 14:45 rpointer.lnd
-rw-r--r-- 1 jacob cels  40 Mar 26 14:46 rpointer.lnd.prev
-rw-r--r-- 1 jacob cels  21 Mar 26 14:46 rpointer.ocn
-rw-r--r-- 1 jacob cels  21 Mar 26 14:46 rpointer.ocn.prev
-rw-r--r-- 1 jacob cels 257 Mar 26 14:46 rpointer.rof
-rw-r--r-- 1 jacob cels  43 Mar 26 14:46 rpointer.rof.prev

rljacob avatar Mar 26 '24 19:03 rljacob

Thanks @ambrad. Merge whenever you're ready.

rljacob avatar Mar 27 '24 01:03 rljacob