flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

idea: support a node failure reason for jobs that fail due to a node crash

Open grondo opened this issue 7 months ago • 5 comments

Problem: When a job is terminated due to a node crash, the user is notified with a generic node failure on X exception. Typically a sysadmin or perhaps some scripts will come along later and determine a root cause. Even if the node was drained, the sysadmin may or may not record the ultimate reason for the crash in the drain/undrain reason, there is no standard for recording that reason, and it can't be tied back easily to the affected job.

It would be nice if there was a method or tool that could give extra detail to the user/admins when a job gets a node failure exception. I'm not sure exactly what this would look like. Perhaps a utility similar to undrain which could optionally set a reason when bringing a node back online, then a tool to associate node-failure exceptions in jobs with those reasons.

Just wanted to open an issue since the idea seemed interesting.

grondo avatar Apr 24 '25 17:04 grondo

One idea might be to expand the resource journal to include an annotation when the sysadmin brings a node back with the jobid of the last job run on that node. I'm not as familiar with the resource journal as the job manager journal so apologies if it already does that. There might be a gotcha there if the nodelist spans multiple jobs (and others I'm probably not thinking of.)

Perhaps a utility similar to undrain which could optionally set a reason when bringing a node back online,

Were you thinking that Flux would have tooling for determining the node crash, or the sysadmins would record it manually?

wihobbs avatar Apr 24 '25 17:04 wihobbs

I think we'd need tooling. Sysadmins could either record the reason manually, or it could be done from a script, since the return of nodes to service is often automated. I honestly haven't thought too much of what this would look like, but indeed I was imagining use of the resource eventlog.

It would perhaps be most effective to allow a reason to be posted to the job eventlog, which could then be used by job-list etc, but the job eventlog is read-only after a clean event, so this would not work.

grondo avatar Apr 24 '25 17:04 grondo

It would perhaps be most effective to allow a reason to be posted to the job eventlog, which could then be used by job-list etc, but the job eventlog is read-only after a clean event, so this would not work.

If notifying users that their job failure was due to a node crash (i.e. it's not your fault your job failed), then leaning on the job eventlog does feel right, although as you've said that won't work. Here's a possibly dumb idea, but what if we allowed events to be written to the job manager journal of events directly (bypassing the job's eventlog)? Things like the notification service and our Elastic logging infrastructure would then have access to the node crash reason directly and easily, and could pass this information to the user.

wihobbs avatar Apr 24 '25 18:04 wihobbs

I think I'm taking this conversation off track. @grondo, rereading your original comment, I think you wanted this to be a discussion about tooling for determining node crashes/returns to service, and I've made it about the journal.

The notification service could watch the resource eventlog for specific events relating to node crashes (maybe it could even associate nodes with jobs), and pass the info up to the user, and that wouldn't require changes to the job manager journal of events.

wihobbs avatar Apr 24 '25 18:04 wihobbs

Yeah, I wasn't thinking of the notification service specifically here, but its a good thought the that a notification service could be extended in this manner. Root cause determination could occur days after a job completes, so there is that aspect to consider. I was thinking more of a use case such as:

A user sees that their job terminated due to a node failure. There should be a "tell me more" command. If the node was returned to service with a root cause determination and "reason" logged somewhere, then perhaps this would be feasible. Currently, there's no reason associated with a node that crashes without being drained, and there's no command to find the correct reason, if it exists, that corresponds to the event that affected the job.

grondo avatar Apr 24 '25 22:04 grondo