mail_room icon indicating copy to clipboard operation
mail_room copied to clipboard

Add liveness health check support

Open stanhu opened this issue 4 years ago • 8 comments

When MailRoom is run in Kubernetes, we have found occasions where MailRoom appears to have attempted to stop running, but Net::IMAP is stuck waiting for threads (https://github.com/ruby/net-imap/issues/14).

This commit adds an HTTP liveness checker to enable detection of a terminated MailRoom pod.

stanhu avatar Mar 08 '21 17:03 stanhu

Could this be accomplished if we added something that responded to SIGINFO or something like that?

tpitale avatar Mar 08 '21 17:03 tpitale

I think SIGINFO is a BSD construct, so this would only be supported in macOS or FreeBSD. Using signals in general isn't the most cross-platform friendly way of doing monitoring.

stanhu avatar Mar 08 '21 17:03 stanhu

It doesn't need to be SIGINFO specifically. Linux supports signals generally. And I think there are a lot of other issues with mailroom that would prevent running on windows anyway.

tpitale avatar Mar 08 '21 18:03 tpitale

In general, I'm apprehensive about adding webrick and a web service into the mix.

I'd rather have another repo/project that could provide a web interface of some sort, that was able to query mailroom through some other means 🤔

tpitale avatar Mar 08 '21 18:03 tpitale

Would it be preferable instead to have mail_room report out, like a heartbeat? I feel like that would require less code, generally be safer.

tpitale avatar Mar 08 '21 18:03 tpitale

In this case, I think that is more complex because now you need to have a separate process that determines whether the process is alive. HTTP liveness probes are a common practice in Kubernetes: https://www.magalix.com/blog/kubernetes-and-containers-best-practices-health-probes

As for push vs pull for metrics, Prometheus has written extensively why they prefer a pull model for monitoring, particular for detecting a downed service:

  1. https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push
  2. https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/?utm_source=thenewstack&utm_medium=website&utm_campaign=platform.

stanhu avatar Mar 08 '21 18:03 stanhu

Okay. Re-reviewing this.

If we're going to do it, I'd like to change a few things.

  1. I'd like to be more explicit that this is an HTTP health check, so that if we add other kinds of health checks down the line, this won't have to move in the configuration or naming
  2. I'd like the default not to be nil, but a NoopHealthCheck, been trying to keep from using nil configuration and the &. pattern

I can leave more specific comments in the code, if that is helpful. Sorry I didn't get back to this for months and months (new baby).

tpitale avatar Dec 14 '21 15:12 tpitale

@tpitale Congrats on your new arrival! I've updated this pull request; let me know what you think.

stanhu avatar Dec 15 '21 06:12 stanhu