flux-core
flux-core copied to clipboard
perilog: consider raising an exception when an epilog fails
Problem: The perilog plugin currently does not raise an exception when the epilog fails, as documented in this comment:
https://github.com/flux-framework/flux-core/blob/d4cdf62a1ddc1ea636afe4918e6c34e118fabf23/src/modules/job-manager/plugins/perilog.c#L422-L429
This makes sense when the job-manager epilogs were mostly meant for administrative cleanup (in fact, in some situations you may not want to notify users of epilog failures), but now that housekeeping is used for most of the administrative stuff this may need to be revisited.
As an example, a job-manager epilog may be used to flush or move data from the compute nodes, so that the job does not emit the clean
event and become inactive while data has not yet been moved. In this case the user should indeed be notified of an exception during this epilog.
Perhaps a non-fatal exception should be raised if the epilog fails or times out, so it appears in the job eventlog, is emitted by flux job attach
, etc.