CTSM icon indicating copy to clipboard operation
CTSM copied to clipboard

Add option to write out variable names before history write to catch problems in the write

Open ekluzek opened this issue 4 years ago • 2 comments

We sometimes have problems like #1545 where the model fails writing history variables. It would be easier to identify which fields fail if there was a write to the log file just before for which field it's currently writing. This would be an option you would turn on with a namelist flag. With a name say something like

hist_debug_write

And important part of this would be that it do a "sys_flush" after the write, and that it do it from every processor.

ekluzek avatar Nov 11 '21 16:11 ekluzek

I like the idea of being able to see this information more easily. I'm wondering, though, if there's a more general mechanism that we could use for this situation and others (e.g., writing to restart files), rather than introducing a number of separate but similar mechanisms.

One thought is: I notice that we're inconsistent in our use of pio_seterrorhandling in ncdio_pio: in some situations we use that so that pio returns an error status that we can explicitly check and then print a meaningful error message, but it looks like we don't do that for the relevant write. Would calling pio_seterrorhandling (setting it to PIO_BCAST_ERROR) allow us to catch the error and then print out information about the field being written? Or would the floating point error still cause the model to abort in this situation, without PIO getting the chance to return to caller? If this worked well, then this seems like a nice solution in that we'd always get the field name written out when there is an error, without needing to turn on a flag and rerun.

The other thing I'm wondering is if you can already get this information by setting PIO_DEBUG_LEVEL appropriately high. If not, should we request that this be folded into that option so that this one flag can be used for all purposes like this?

billsacks avatar Nov 11 '21 18:11 billsacks

That's a good point about the error handling. I think you are right we should structure it that way. In this case I'm not sure if it would help though since it's a floating point exception so it's dying with a signal abort that's probably not going to allow the normal PIO error handling to happen. But, I'm not completely sure either.

That's also a good point about handling this with PIO_DEBUG_LEVEL, that would prevent an additional flag being added.

ekluzek avatar Nov 11 '21 19:11 ekluzek