idea: health check at startup
Problem: as noted in #7220, when a broker restarts, there is no node "health check" that would detect trouble with a node (after an OOM for example) before allowing jobs to run there.
The simple workaround proposed in #7220 was to suggest a site-provided script could be dropped in to rc1.d. If the script fails, startup aborts. However, this does mean that the cause of the node being offline is not immediately obvious. The sys admin would have to check the node's journal to see any errors logged by the script.
This seems like a general problem. A related one is
- #6590
Just wanted to get this open to see if we can come up with a generalized solution for this class of problems.
This is a great idea, and would be really powerful designed as a plugin framework (so it is easy to write and register custom health checks). Another question is if checks might vary based on node in the cluster, or even user or time of day. Examples why:
- My nodes may have different kinds of GPU (or more generally, hardware)
- A user job was doing something specific that leads to well-known kinds of errors (e.g., Usernetes or a container, which could have something still running)
Things to think about:
- If health checks should have triggers based on metrics (e.g., if a node does X this many times, we need to respond with action Y)
- If I could run health checks on demand (e.g., "Check that this node networking is still OK in this way."
- What is the distinction between a health check and collection of some metric - a binary outcome (good / bad, ok / not ok)?
- To follow up to the above - if the check fails and the script aborts, that implies a binary outcome (and turning metrics into thresholds)
- How would checks be put in logical groups for execution (e.g., "Run all the NVIDIA GPU checks)
- Can health checks be added by users too?
- Do any health checks require more privilege?
- Do results of health checks get saved over time?
Good thoughts @vsoch !
I think our admins already have a workable node health check system which they run at startup and may be running from housekeeping as well. I suspect they would prefer a mechanism to hold resources offline until a site-provided script returns successfully.
Where to hook that in is a question. We've discussed resource module (the core one) plugins in the past. One approach might be to add plugins with an initial healthcheck callback that are passed the R fragment for the local node. A simple plugin that just runs a (local) script could be provided. The resource module could hold the local node offline until all plugin healthcheck callbacks return success.
@vsoch's :brain: :cloud_with_lightning: list above is pretty helpful when thinking about such a new capability!
Another healthcheck use-case suggestion from SNL.
- Ability to configure a synchronous healthcheck. As @garlick mentioned, a simple plugin that just runs a local script is fine. Any action to be taken must be explicitly performed by the script. Expectation of running in a configurable defined time limit window.