Better monitoring of Fatman's condition

Open iszulcdeepsense opened this issue 3 years ago • 1 comments

Let's improve the way of reporting current Fatman's status on Dashboard by extending it with information such as:

memory usage status - eg. warning when you're running out of memory.
how many times the container has been crashed / restarted / OOM-killed.

This will give a better insight about what's happening with the workload.

Some deployers other than Kubernetes won't have access to these extra data. Therefore it should be achieved in a general way, taking advantage of plugins system. The plugin should just report status (Green, Yellow or Red) with an explanation field, describing the reason of malfunction.

Nov 04 '22 15:11 iszulcdeepsense

Good scope. I suspect it'll be a bit finicky to design right, because we can't predict what kind of deployment targets will be implemented. So maybe a common framework is the way to do it?

Agree on the red/yellow/green. Maybe to begin with only red and green, since yellow becomes a matter of definition and can be subjective?

Maybe it's something like this:

RT supports displaying the fatman status
The job type is responsible for giving RT the functionality to know whether a given fatman is red or green. So for example, if we deployed a k8s fatman, then the k8s job type plugin has a method e.g. check_fatman_status(fatman_id) which RT can execute, and it maybe returns red/green along with a hint?

This way it's RT that has the responsibility and the job type plugin which has the implementation.

Nov 07 '22 05:11 JosefAssadERST