MOM6 icon indicating copy to clipboard operation
MOM6 copied to clipboard

Checkpoint restart memory in case of crash

Open nichannah opened this issue 9 years ago • 1 comments

@mjharriso had the idea of implementing a way to checkpoint the restart memory regularly. Then when the model crashes an exception handler can access the saved memory and write out a restart.

@Hallberg-NOAA outlined a way that this could be done. Just make another instance of the restart_CS, instead of it containing pointers to model field arrays, it should contain pointers to allocated memory. The checkpointing routine would copy over all the latest data pointed to by the restart_CS into allocated memory.

The checkpoint would be written out by calling into the regular MOM_restart interface using the checkpoint instance of the restart_CS.

What's not clear is how/whether the exception handler can have access to the necessary checkpoint restart_CS, and other program memory needed to dump a restart.

If we are going to write MPI exception handlers it would also be worth adding something to dump a stack trace. e.g. intel compilers have tracebackqq().

nichannah avatar May 21 '15 16:05 nichannah

Perhaps another good thing to do within an MPI exception handler would be to dump the FP exception register.

nichannah avatar May 21 '15 16:05 nichannah