flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

idea: allow compute nodes to be released by a job before the `clean` event

Open grondo opened this issue 1 year ago • 2 comments

I couldn't find the exact discussion I was thinking of in a meeting today regarding early release of resources during epilog actions.

Perhaps it was related to #5774 but that idea was a bit out there.

The specific use case is data movement tasks that do not actually utilize compute resources, but use an epilog-start event to prevent the clean event so that the job does not become inactive until data movement has finished. This approach has the side effect of holding back all compute resources, since resources are not handed over to housekeeping until all epilog actions have completed.

In short, an epilog-start event takes a reference on all compute resources for the job, released by the corresponding epilog-finish event.

Perhaps, a simple idea might be to enhance the epilog-start event to allow it to optionally take a reference on a subset of resources. For now, an optional ranks key in the event context could indicate that this epilog action only takes a reference on the included ranks instead of "all". For an epilog that runs on off-node resources (like the rabbits perhaps), an empty idset, e.g. ranks="" could indicate that all compute resources can be handed off to housekeeping, while perhaps an epilog that only required rank 0 of the job could specify ranks="0".

grondo avatar Aug 21 '24 22:08 grondo

The idea of releasing some execution targets back to the job-manager while an epilog action is still in-progress was brought up again in today's meeting.

In addition to allowing an epilog-start event to take a reference on a subset of ranks, we may also need to allow an epilog to release a subset of ranks.

@jameshcorbett: Is the required feature for the dws-epilog to be able to release a subset of ranks, or just take a reference on a subset of ranks, or a mix of both? I.e. how do you envision this working?

grondo avatar Mar 12 '25 23:03 grondo

I think we should circle back to this issue after the resolution of https://github.com/flux-framework/flux-coral2/issues/321. Because it's possible the resolution of that issue will be enough.

But FWIW I think the required feature is that the dws-epilog be able to release a subset of ranks, not to take a reference on a subset of ranks.

jameshcorbett avatar Mar 13 '25 00:03 jameshcorbett