flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

more detailed task exit status reporting

Open grondo opened this issue 8 months ago • 0 comments

Users have reported that there is not enough detail from flux-job attach when a job fails. That is, we currently report:

flux-job: task(s) exited with exit code 1

However, if only one task exited with a nonzero status, this is not reported. If tasks coredump or segfault, the affected tasks are not reported. In both cases, it would be useful to also include the affected hostnames if possible. This would allow users and admins to quickly come to conclusions about bad hosts.

This may require stashing a compact aggregate representation of the exit status of all tasks in the KVS or the eventlog (this may be too large for the eventlog though). This in combination with the job taskmap and assigned hostlist could allow users and the flux job attach command to create a detailed summary of how every task in a job exited.

grondo avatar Jun 24 '24 18:06 grondo