flux-core
flux-core copied to clipboard
more detailed task exit status reporting
Users have reported that there is not enough detail from flux-job attach
when a job fails. That is, we currently report:
flux-job: task(s) exited with exit code 1
However, if only one task exited with a nonzero status, this is not reported. If tasks coredump or segfault, the affected tasks are not reported. In both cases, it would be useful to also include the affected hostnames if possible. This would allow users and admins to quickly come to conclusions about bad hosts.
This may require stashing a compact aggregate representation of the exit status of all tasks in the KVS or the eventlog (this may be too large for the eventlog though). This in combination with the job taskmap and assigned hostlist could allow users and the flux job attach
command to create a detailed summary of how every task in a job exited.