Add liveness health check support
When MailRoom is run in Kubernetes, we have found occasions where
MailRoom appears to have attempted to stop running, but Net::IMAP is
stuck waiting for threads (https://github.com/ruby/net-imap/issues/14).
This commit adds an HTTP liveness checker to enable detection of a terminated MailRoom pod.
Could this be accomplished if we added something that responded to SIGINFO or something like that?
I think SIGINFO is a BSD construct, so this would only be supported in macOS or FreeBSD. Using signals in general isn't the most cross-platform friendly way of doing monitoring.
It doesn't need to be SIGINFO specifically. Linux supports signals generally. And I think there are a lot of other issues with mailroom that would prevent running on windows anyway.
In general, I'm apprehensive about adding webrick and a web service into the mix.
I'd rather have another repo/project that could provide a web interface of some sort, that was able to query mailroom through some other means 🤔
Would it be preferable instead to have mail_room report out, like a heartbeat? I feel like that would require less code, generally be safer.
In this case, I think that is more complex because now you need to have a separate process that determines whether the process is alive. HTTP liveness probes are a common practice in Kubernetes: https://www.magalix.com/blog/kubernetes-and-containers-best-practices-health-probes
As for push vs pull for metrics, Prometheus has written extensively why they prefer a pull model for monitoring, particular for detecting a downed service:
- https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push
- https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/?utm_source=thenewstack&utm_medium=website&utm_campaign=platform.
Okay. Re-reviewing this.
If we're going to do it, I'd like to change a few things.
- I'd like to be more explicit that this is an HTTP health check, so that if we add other kinds of health checks down the line, this won't have to move in the configuration or naming
- I'd like the default not to be
nil, but aNoopHealthCheck, been trying to keep from using nil configuration and the&.pattern
I can leave more specific comments in the code, if that is helpful. Sorry I didn't get back to this for months and months (new baby).
@tpitale Congrats on your new arrival! I've updated this pull request; let me know what you think.