racetrack icon indicating copy to clipboard operation
racetrack copied to clipboard

Better monitoring of Fatman's condition

Open iszulcdeepsense opened this issue 3 years ago • 1 comments

Let's improve the way of reporting current Fatman's status on Dashboard by extending it with information such as:

  • memory usage status - eg. warning when you're running out of memory.
  • how many times the container has been crashed / restarted / OOM-killed.

This will give a better insight about what's happening with the workload.

Some deployers other than Kubernetes won't have access to these extra data. Therefore it should be achieved in a general way, taking advantage of plugins system. The plugin should just report status (Green, Yellow or Red) with an explanation field, describing the reason of malfunction.

iszulcdeepsense avatar Nov 04 '22 15:11 iszulcdeepsense

Good scope. I suspect it'll be a bit finicky to design right, because we can't predict what kind of deployment targets will be implemented. So maybe a common framework is the way to do it?

Agree on the red/yellow/green. Maybe to begin with only red and green, since yellow becomes a matter of definition and can be subjective?

Maybe it's something like this:

  • RT supports displaying the fatman status
  • The job type is responsible for giving RT the functionality to know whether a given fatman is red or green. So for example, if we deployed a k8s fatman, then the k8s job type plugin has a method e.g. check_fatman_status(fatman_id) which RT can execute, and it maybe returns red/green along with a hint?

This way it's RT that has the responsibility and the job type plugin which has the implementation.

JosefAssadERST avatar Nov 07 '22 05:11 JosefAssadERST