flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

perilog: consider raising an exception when an epilog fails

Open grondo opened this issue 4 months ago • 0 comments

Problem: The perilog plugin currently does not raise an exception when the epilog fails, as documented in this comment:

https://github.com/flux-framework/flux-core/blob/d4cdf62a1ddc1ea636afe4918e6c34e118fabf23/src/modules/job-manager/plugins/perilog.c#L422-L429

This makes sense when the job-manager epilogs were mostly meant for administrative cleanup (in fact, in some situations you may not want to notify users of epilog failures), but now that housekeeping is used for most of the administrative stuff this may need to be revisited.

As an example, a job-manager epilog may be used to flush or move data from the compute nodes, so that the job does not emit the clean event and become inactive while data has not yet been moved. In this case the user should indeed be notified of an exception during this epilog.

Perhaps a non-fatal exception should be raised if the epilog fails or times out, so it appears in the job eventlog, is emitted by flux job attach, etc.

grondo avatar Oct 04 '24 23:10 grondo