flux-core
flux-core copied to clipboard
support flux "up vs down" module so whatsup can work with flux
I suddenly remember that we discussed this like 6 or 7 years ago. It would be nice if whatsup (https://github.com/chaos/whatsup) could determine node up vs. down status via flux, since flux will be sending around a heartbeat very similarly to other mechanisms that whatsup works with (ganglia, cerebro). On systems with TOSS this may allow us to remove the need for cerebro.
It might even be possible for users to use whatsup in sub-instances too if they'd like.
Off the top of my head:
A) overlay health might be close to having everything whatsup needs, although I'm not sure if this is the right mechanism to use long term, since it's possible people may set up multiple brokers per node. Or if a broker goes down and takes down children brokers, whatsup won't have the right information.
B) what might be best is to write a module that simply listens for flux heartbeats then supplies that info to whatsup along w/ most recent heartbeat timestamp. Could also write a streaming rpc to supply up/down info so whatsup can support the --monitor
option with flux. Not sure if module would need to map rank to hostname or if whatsup could do that outside of the module via other flux calls, TBD later.
Maybe it would be useful to add a whatsup(1) compatible subcommand to flux-overlay(1)?
An idset and/or hostlist representing up and down ranks can be easily generated from groups info tracked at rank 0.
However, besides the issues with using flux that you mention, I think whatsup(1) and cerebro's simplicity are their main advantage. Flux might be part of the problem that whatsup(1) helps diagnose.
However, besides the issues with using flux that you mention, I think whatsup(1) and cerebro's simplicity are their main advantage. Flux might be part of the problem that whatsup(1) helps diagnose.
That's a good point, having a separate mechanism for determining up vs down can be good for diagnosing other problems. But there has been a desire to move on from cerebro since it's quite heavyweight for its now limited usage.
I like the idea of flux overlay whatsup
as another way to view cluster health, if they would find that helpful given the caveats.
BTW, this statement is not quite accurate:
since flux will be sending around a heartbeat very similarly to other mechanisms that whatsup works with (ganglia, cerebro).
The heartbeat is sent by rank 0 to the other ranks over the tree, and it's for synchronization not monitoring. It's not like cerebro where each rank sends out a multicast beacon.
I like the idea of
flux overlay whatsup
as another way to view cluster health, if they would find that helpful given the caveats.
I'd actually suggest flux whatsup
in this case if there isn't a strong reason to put it under the flux overlay
command
The heartbeat is sent by rank 0 to the other ranks over the tree, and it's for synchronization not monitoring. It's not like cerebro where each rank sends out a multicast beacon.
Oh yeah, you're right. Now that I think about it, I think the original discussion was back when there was more discussion of uhhhh "synchronized" (possibly wrong word) messages being sent out on a timer. I think flux cron
was the eventual result, but that's not quite what we'd want to use for a regular heartbeat.
I suppose on the extreme end we could just write a new service module that sends out "up/down" heartbeats if we really wanted to implement something that resembles ganglia/cerebro. It'd be quite easy and ridiculously simpler with all of flux's infrastructure in place. Perhaps that's a hack-a-thon project someday or when folks say "we really really really want to retire cerebro".
I'll write up an issue for the flux overlay whatsup
though, since I'd think that'd be separate from this.
I'd actually suggest flux whatsup in this case if there isn't a strong reason to put it under the flux overlay command
We used to have the flux up
command a long time ago too, which we could revive.