payu icon indicating copy to clipboard operation
payu copied to clipboard

Saving errors logs from a hanging PBS job

Open marshallward opened this issue 6 years ago • 4 comments

While we appear to be saving error logs for crashed jobs into error_logs in archive, it seems that I am losing information from hanging jobs which run indefinitely and are eventually killed by the scheduler.

This is presumably because PBS is killing the python process before the model returns SIGTERM or whatever.

We should probably investigate this a little more and also monitor PBS state, if possible. It may not actually be possible to call any code at the Python level after exceeding job time.

marshallward avatar Jan 22 '18 00:01 marshallward