flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

design job notification system

Open grondo opened this issue 3 years ago • 3 comments

Currently there is no design for this feature. Possibilities include:

  • simple job shell plugin for email notification. This may not allow an email notification in all cases such as job failure or cancelation
  • jobtap plugin which optionally enables email notification
  • a python program that enables configurable notification for jobs. This program can watch all submitted jobs, check for notification specification in jobspec, and then use existing python modules to implement email and other notification. If the program registers a service within the instance, then it could also allow users to separately configure notification instead of requiring per-job options. (I had thought there was an open issue on this one, but can't find it at the moment).

grondo avatar Jul 26 '22 18:07 grondo

a python program that enables configurable notification for jobs.

This may be a good candidate for an independent framework project?

Also, it would be nice if it could be started on any node, since in some cluster environments, some nodes have better external network connectivity than others.

garlick avatar Jul 27 '22 19:07 garlick

I like option 3 the best (python program watching) because it could be configured to notify via other services as well, like a Slack/Teams bot, or send text messages (although you could also send a text message by just specifying a shorter format and sending it as an email, if the user specified their cell provider upon registration.)

My concern is that having a background service that's constantly polling all available jobs, or polling at specified intervals, is too heavy for something that only a few users would be using. Would option 2 be lighter? Or am I thinking about this the wrong way?

I found this project that's a Slurm extension that I'm going to play around with to see if it's similar to the approach we want.

wihobbs avatar Sep 07 '23 17:09 wihobbs

My concern is that having a background service that's constantly polling all available jobs, or polling at specified intervals, is too heavy for something that only a few users would be using. Would option 2 be lighter? Or am I thinking about this the wrong way?

My feeling is that one centralized program that is doing this monitoring is not going to be high impact. Some tricky aspects of an external service might be:

  • ensuring notifications are not duplicated when the service is restarted
  • polling in an efficient manner, i.e. just asking for what has been updated since the last poll interval

Note that we have at least two examples of external services that are started under systemd for the system instance (flux-accounting database service, and the flux-coral2 dws service), However, in this case a user may want to have access to notifications for a batch job so that's something else to consider.

grondo avatar Sep 07 '23 18:09 grondo