flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

Feature request: ability to modify or overwrite undrain messages

Open kkier opened this issue 2 months ago • 4 comments

TLDR: I'd like to be able to change or update the undrain messages for nodes.

We have a reporting workflow right now that uses drain and undrain messages to categorize failures. Briefly, I create a composite event that starts at the time a node is drained and ends at the time it is undrained, with a "description" based on the last drain or undrain message. The idea being that as nodes are triaged, we update the drain message, and then eventually undrain with a message indicating what was done to resolve the issue.

This falls apart when nodes get accidentally undrained, when underlying causes are found afterwards, you name it. Thus, it'd be handy to be able to retroactively modify an undrain message to keep the information accurate.

Naturally, half of me types this and cringes because we have a (nominally?) immutable record and I'm talking about, err, muting it. A potential approach (suggested by @grondo earlier in slack) would be to have some kind of "update" event linked to the undrain event that allows for adding another message but doesn't change the original. I could even do the implementation in a wrapper for my reporting with a separate database of override messages, but the overall hack-ness of that idea makes me cry just thinking about it.

kkier avatar Oct 24 '25 17:10 kkier

Well, I don't actually see anything in RFC 44 that says an undrain event can only be posted for ranks that were previously drained, so one idea would be to add a new "mode" to the undrain RPC like "update" which simply posts an extra undrain event to the eventlog with the new reason. A tool would then have to process the entire resource eventlog to be sure the most updated undrain reason event for any given rank has been processed (instead of assuming the first undrain event after the corresponding drain event has the correct reason).

There may also be work required in the resource module to ignore undrain events for ranks that are already undrained.

Part of me does wonder if an eventlog is the right tool for this job, but the above would be fairly straightforward 🤷

I'd like to get @garlick's feedback when he's back from vacation.

grondo avatar Oct 24 '25 19:10 grondo

Short term, yeah, I think a new undrain mode seems OK.

Longer term it does feel like we have a design issue here. See also

  • #7135
  • #6624
  • #6477
  • #6132

etc

garlick avatar Nov 03 '25 16:11 garlick

Looking at @grondo's short term idea, it would appear that we just want flux resource undrain --force to work like flux resource drain --force, which is

If any of targets are already drained, do not fail. Overwrite the original drain reason. When --force is specified twice, the original drain timestamp is also overwritten.

Does that sound right?

garlick avatar Nov 03 '25 22:11 garlick

Should this have been closed by #7187? We may want to set up a tracking issue or similar for the long term issues @garlick mentioned above.

grondo avatar Nov 12 '25 16:11 grondo