Better monitoring of Fatman's condition
Let's improve the way of reporting current Fatman's status on Dashboard by extending it with information such as:
- memory usage status - eg. warning when you're running out of memory.
- how many times the container has been crashed / restarted / OOM-killed.
This will give a better insight about what's happening with the workload.
Some deployers other than Kubernetes won't have access to these extra data. Therefore it should be achieved in a general way, taking advantage of plugins system.
The plugin should just report status (Green, Yellow or Red) with an explanation field, describing the reason of malfunction.
Good scope. I suspect it'll be a bit finicky to design right, because we can't predict what kind of deployment targets will be implemented. So maybe a common framework is the way to do it?
Agree on the red/yellow/green. Maybe to begin with only red and green, since yellow becomes a matter of definition and can be subjective?
Maybe it's something like this:
- RT supports displaying the fatman status
- The job type is responsible for giving RT the functionality to know whether a given fatman is red or green. So for example, if we deployed a k8s fatman, then the k8s job type plugin has a method e.g.
check_fatman_status(fatman_id)which RT can execute, and it maybe returns red/green along with a hint?
This way it's RT that has the responsibility and the job type plugin which has the implementation.