flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

differentiate between FAILED and NODE_FAIL

Open ryanday36 opened this issue 11 months ago • 1 comments

The 'result' field in the flux jobs output lists four possible results: COMPLETED, FAILED, CANCELLED, TIMEOUT. HPE folks would like to distinguish between job failures due to user code issues and job failures due to nodes failing. #6021 and #6032 talk about improving error messages when brokers lose contact, is there a way to get at that in the flux jobs output?

ryanday36 avatar Feb 20 '25 00:02 ryanday36

Details of a fatal job exception are available in the exception.* fields described in flux-jobs(1). Relevant bits are pasted here:

exception.occurred

    True of False if job had an exception, empty string otherwise
exception.severity

    If exception.occurred True, the highest severity, empty string otherwise
exception.type

    If exception.occurred True, the highest severity exception type, empty string otherwise
exception.note

    If exception.occurred True, the highest severity exception note, empty string otherwise

e.g.:

$ flux jobs -Af failed -no '{id.f58:<12} {exception.type:<25} {exception.note}' | grep node-failure
fdFmCpYwQZ5  node-failure              node failure on elcap8767 (shell rank 1127)
fcod73yYa6b  node-failure              node failure on elcap8659 (shell rank 83)

Some caveats to consider: Currently I think job-list only captures one exception (the most recent and severe?). Some node failures do not cause a fatal job exception, e.g. if the failure occurs on a non-critical rank. A job could get multiple of these. All exceptions, regardless of severity, will appear in the job eventlog, and therefore would be captured by job manager journal consumers.

We could consider adding a new job result like slurm's NODE_FAIL, but since one or more nodes could be lost, but the user job could continue running then fail for another reason, there may not be a clean way to do it in Flux. That is, just because a node failed during a job does not imply that the job failed because of it.

grondo avatar Feb 20 '25 15:02 grondo