sergeant Feature: add http health probe

Hey, I suggest adding an HTTP server to supervisor, for the purpose of being able to monitor the health status of the worker pool. An http request will return a status of currently running workers, and perhaps supervisor metadata. It’s possible to incorporate query string params for what’s considered “healthy” supervisor: I.e no more than N workers are silent for more than T seconds.

Dec 31 '20 23:12 dhkron

We need to think about when a Supervisor is considered healthy. At the minimal level, a responding Supervisor is healthy, yet it does not imply anything about the status of its sub-workers. How would you tell if a sub-worker is healthy. It has no communication with the Supervisor unless it finished its pile of tasks.

About the implementation suggestion, I know k8s supports HTTP health-checks but I'm not sure this is a good way to go here. Implementing an HTTP server for the purpose of health-checks sounds to me like overkill. I'd explore a TCP Asyncio server first. HTTP server brings more overhead than I'm willing to pay here and they both bring the same effect. We can also consider using a file-based probe.

Anyway, we must define "Healthy" before we proceed with either of these.

Jan 03 '21 09:01 wavenator

For the "healthy" definition, I'd consider "X out of Y workers have polled messages in the last T seconds". Each worker T is different, for some workers take few seconds and some take few minutes. The amount of silent workers X can also be customized.

As for TCP/File based - the upside of HTTP health probe is being able to take parameters in a simple manner, as simple as k8s's httpGet health probe yaml block. That way, the HTTP server does not need to consider whether his workers are healthy or not - he gets X & T as parameters and then can give the healthiness result based on X & T criteria. Doing this on File-based health probe is impossible, and TCP is possible, yet more tricky. However, since Supervisor serves as a middle layer, X & T can be part of his own config, and then TCP & File can be used.

Jan 03 '21 09:01 dhkron

I think a health check response should return a boolean result, whether the service is healthy or not. The parameters defining healthiness should be part of the worker's configuration. I'm not sure about the availability of such a solution though. The interaction between workers and their supervisors is much more primitive than you would imagine. Their only interaction happens when the worker reached its max_tasks_per_run. We should also think about a starving worker, not consuming tasks because there are no tasks to consume, and not due to a healthiness problem. We should think about the edge-cases here. I think implementing a health check that merely indicates the supervisor is alive and responding is enough. Indicating the workers' healthiness should be discussed thoroughly to develop a lightweight and precise solution that is not prone to false positives.

Jan 03 '21 09:01 wavenator

I think a good-enough approach for workers healthiness is whether or not they look in the queue. Making the thresholds part of the request will allow a single & simple server to respond accordingly, and the workers will remain simple - only logging their last poll time. In addition, a boolean result is great - could be status code 200 & 500 - but you could also add statistics to the response, if they are available. All you need in this approach is that the workers log their last poll time in a shared memory with the supervisor, a relatively small change.

Jan 03 '21 14:01 dhkron

sergeant sergeant copied to clipboard

Feature: add http health probe

sergeant
sergeant copied to clipboard