botkube icon indicating copy to clipboard operation
botkube copied to clipboard

Automated Pod debugging

Open timstoop opened this issue 5 years ago • 0 comments

Is your feature request related to a problem? Please describe. When a Pod restarts, it still requires some manual steps to figure out why it restarted. Those steps could actually be automated, as almost all of that information is part of the Pod status. It would be nice if instead of just notifying you that a Pod has restarted (which Prometheus already does as well), botkube would perform a few automated tests to determine more information about it.

Describe the solution you'd like A very simple solution to start out with would be:

Pod restarts -> Check the status of the container that caused the restart -> Is it OOMKilled? report that -> Is it Terminated with non-zero exit status? Report that and maybe show the last 100 lines of the previous log for the Pod

Afterwards, this could be extended with a form of knowledge base. For instance, if the exit code is 137, it could inform that this often happens when the application itself determines that it does not have enough memory available and kills itself with a sigkill. This would make it a lot easier for especially new Kubernetes users to start debugging their apps on Kubernetes.

Describe alternatives you've considered We've currently documented these steps for our users. Having them automated would be preferred.

timstoop avatar Dec 19 '19 07:12 timstoop