flux-core
flux-core copied to clipboard
Tolerate some node-specific failures in job prolog
In a meeting today, several members of the team discussed how Flux could tolerate individual nodes failing some sort of prolog check while still allowing the job to proceed, ideally without those nodes. This would be useful on systems with prologs have a significant chance of failing, especially on systems with long queue times. Then if a user required 50 nodes, they could submit a request for 60 nodes and somehow indicate that they would be satisfied with 10 of those nodes failing a prolog.
One proposed approach was for the job to proceed with its whole set of nodes regardless of prolog failures, but with some flag set so that if the job happens to be a Flux instance, those nodes would be drained, or placed in a separate queue. Perhaps if the job is not a Flux instance, the job should fail?
There would need to be a mechanism for users to specify how many failures they would tolerate, and also a mechanism for prologs to indicate which specific nodes failed their checks.
A use-case is described here: https://github.com/flux-framework/flux-coral2/issues/364
Thanks @jameshcorbett!
One idea proposed by @garlick was to have the prolog exit with a special nonzero exit code. The perilog.so plugin could then collect the ranks on which this code occurred, and then raise a nonfatal job exception, somehow indicating the set of job ranks or hostnames that failed. The job could then act on this exception. If it is an instance of Flux, it could start with those nodes drained.
After typing that up, I realized one issue with a special exit code is that exiting early from the prolog due to a potentially nonfatal error is probably not a good idea. This could skip other parts of the prolog that are necessary for preparing a node for use by the user. The node may be drained, but the user could still log in, or choose to undrain the node.
In general, allowing a user to ignore any generic prolog failure and still get access to the resources is probably a security issue. Therefore, we may want to either fall back on the idea of removing the failed resources from R with a resource-update event, or come up with a way to have specific parts of the prolog be allowed to fail while still allowing the job to proceed.
The perilog plugin does receive output from the prolog script(s), so perhaps some sort of optional protocol or formatted control messages on stdout could be used to notify the plugin of these potentially recoverable errors. This would have the added benefit of allowing a specific error message to be added to the job exception.
After typing that up, I realized one issue with a special exit code is that exiting early from the prolog due to a potentially nonfatal error is probably not a good idea. This could skip other parts of the prolog that are necessary for preparing a node for use by the user. The node may be drained, but the user could still log in, or choose to undrain the node.
That makes sense. Skipping other parts of the prolog would be bad. However, the use-case I was thinking of is not for allowing an administrative prolog script to fail, but rather to allow a jobtap prolog action to identify problematic nodes (i.e. nodes that did not mount their rabbit file system) using an external service. In this case the problematic nodes would not cause any kind of security issue, and the actual administrative prolog scripts should be able to run to completion normally on those nodes.
In general, allowing a user to ignore any generic prolog failure and still get access to the resources is probably a security issue. Therefore, we may want to either fall back on the idea of removing the failed resources from R with a resource-update event, or come up with a way to have specific parts of the prolog be allowed to fail while still allowing the job to proceed.
I agree that allowing generic prolog failures to be ignored would be bad. I was thinking all prolog failures would be treated as fatal (as they currently are) unless they specifically marked themselves as "optional" or something like that.
D'oh, I misunderstood the use case, sorry! :facepalm:
This will be slightly less complicated to support -- the determination of which nodes "failed" could be contained in the coral2 jobtap plugin, but we'll still need some way to communicate to the subinstance which ranks/hosts failed and to drain them.
There is a similar request in #6799, though this one requires a way to force nodes down while a job is already running. Maybe the non-fatal job exception can work in both cases as suggested by @garlick. We'll need some way to add structured data to an exception I think, and then a job or subinstance can look for this specific type of exception and take action.
I don't think job exceptions currently have any place to put extra data, so maybe some RFC work would be required first?
While we determine a way to do this generally in Flux, perhaps a kind of proof of concept or stopgap solution fully implemented in flux-coral2 could work for https://github.com/flux-framework/flux-coral2/issues/364:
- User would specify (somehow) the number of nodes on which the flux-coral2 prolog is allowed to fail, with the default being 0.
- The flux-coral2 jobtap plugin or whatever manages the prolog action would not mark the prolog as failed if the number of failures is less than or equal to the specified count.
- If some ranks failed, the flux-coral2 plugin would post an event to the job's eventlog with the set of failed nodes.
- flux-coral2 provides an rc1 init script installed to
rc1.dthat checks for the event above and drains the affected nodes if found. Since the initial program doesn't launch until rc1 is complete, this would ensure affected nodes are drained before the user workload is started.
The more I think about a generic solution, the trickier it seems, since there could be multiple subsystems where a failure could occur, and perhaps different levels of resiliency for each, so maybe a temporary solution for now will let us ponder how this could work more holistically.
It would also be interesting to eventually support actual node failures during a job launch. AFAIK, this isn't possible now because PMI requires every node to be up before it can complete.