flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

write job statistics to guest namespace on instance exit

Open garlick opened this issue 1 year ago • 2 comments

Problem: we don't have much information about what happens in Flux instances launched by the system instance, unlike slurm which tracks job steps centrally.

Following up on a discussion in today's meeting about finding out what sort of "workflows" people are running: Could we have a flux instance write a summary object of some sort to the guest KVS directory in its enclosing instance?

A tool could summarize those objects at the system instance level.

For extra credit they could be made hierarchical, e.g. the object would include a summary of the objects written by sub-instances in their KVS.

FWIW, RFC 16 governs the content of the job KVS directory.

garlick avatar May 09 '24 02:05 garlick

This is actually going to be quite important on systems where most or all jobs run at the system instance level are batch or alloc jobs. This is because most of the detail of the actual jobs run by the user is currently lost when the batch job exits, unless the user thought to use --dump on the command line.

For example it would be nice to be able to use flux job taskmap on a parallel job run within a batch job, or even be able to access job eventlogs. Perhaps we can persist some of this information in the KVS in such a way that some of these commands can work naturally.

grondo avatar Jun 07 '24 20:06 grondo