sergeant
sergeant copied to clipboard
Feature: add http health probe
Hey, I suggest adding an HTTP server to supervisor, for the purpose of being able to monitor the health status of the worker pool. An http request will return a status of currently running workers, and perhaps supervisor metadata. It’s possible to incorporate query string params for what’s considered “healthy” supervisor: I.e no more than N workers are silent for more than T seconds.
We need to think about when a Supervisor is considered healthy. At the minimal level, a responding Supervisor is healthy, yet it does not imply anything about the status of its sub-workers. How would you tell if a sub-worker is healthy. It has no communication with the Supervisor unless it finished its pile of tasks.
About the implementation suggestion, I know k8s supports HTTP health-checks but I'm not sure this is a good way to go here. Implementing an HTTP server for the purpose of health-checks sounds to me like overkill. I'd explore a TCP Asyncio server first. HTTP server brings more overhead than I'm willing to pay here and they both bring the same effect. We can also consider using a file-based probe.
Anyway, we must define "Healthy" before we proceed with either of these.
For the "healthy" definition, I'd consider "X out of Y workers have polled messages in the last T seconds". Each worker T is different, for some workers take few seconds and some take few minutes. The amount of silent workers X can also be customized.
As for TCP/File based - the upside of HTTP health probe is being able to take parameters in a simple manner, as simple as k8s's httpGet
health probe yaml block.
That way, the HTTP server does not need to consider whether his workers are healthy or not - he gets X & T as parameters and then can give the healthiness result based on X & T criteria.
Doing this on File-based health probe is impossible, and TCP is possible, yet more tricky.
However, since Supervisor serves as a middle layer, X & T can be part of his own config, and then TCP & File can be used.
I think a health check response should return a boolean result, whether the service is healthy or not. The parameters defining healthiness should be part of the worker's configuration. I'm not sure about the availability of such a solution though. The interaction between workers and their supervisors is much more primitive than you would imagine. Their only interaction happens when the worker reached its max_tasks_per_run. We should also think about a starving worker, not consuming tasks because there are no tasks to consume, and not due to a healthiness problem. We should think about the edge-cases here. I think implementing a health check that merely indicates the supervisor is alive and responding is enough. Indicating the workers' healthiness should be discussed thoroughly to develop a lightweight and precise solution that is not prone to false positives.
I think a good-enough approach for workers healthiness is whether or not they look in the queue. Making the thresholds part of the request will allow a single & simple server to respond accordingly, and the workers will remain simple - only logging their last poll time. In addition, a boolean result is great - could be status code 200 & 500 - but you could also add statistics to the response, if they are available. All you need in this approach is that the workers log their last poll time in a shared memory with the supervisor, a relatively small change.