flux-core
flux-core copied to clipboard
flux-top: display CPU% utilization
A use case for flux top brought up by AHA Moles team was monitoring job ensembles for CPU utilization, as an aid to tuning machine learning jobs.
This could be collected at the shell plugin level, with the rank 0 shell periodically posting an aggregate number as a job memo, which could then be accessed by flux jobs and flux top. The sample interval could default to some long period like a minute, and be tunable by shell option.
One challenge for flux top as it is currently implemented is that it only queries job-list after job state change events are published. Maybe we could have flux top watch for certain kinds of activity in the job manager journal instead? Or maybe job-list could provide a specialized streaming RPC for job monitoring tools.
Maybe we could have flux top watch for certain kinds of activity in the job manager journal instead?
Is the journal accessible by guests?
I thought you had an idea for a multi-response RPC for job-list which would only reply on updates. That might be a bit challenging to implement, though.
One challenge for flux top as it is currently implemented is that it only queries job-list after job state change events are published.
Would it be so bad to just query job-list every N seconds for now until a better solution is implemented?
Good point about journal permission!
I edited my description to include the job-list idea concurrently with your comment. Sorry about that.
Would it be so bad to just query job-list every N seconds for now until a better solution is implemented?
Yeah that would probably be fine for a first cut.