pyctuator
pyctuator copied to clipboard
Add optional support for k8s liveness and rediness probes
See https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-kubernetes-probes for how its done in spring-boot's actuator.
In high level, pyctuator should support telling k8s when an application/service is ready to serve requests (i.e. readiness probe) and if the application is alive (this is the liveness probe) - this is different from the health status returned by /pyctuator/health
for two reasons:
- it should return fast, so shouldn't include health checks of external resources
- it should reflect the k8s lifecycle as described in actuator's documentation (and in k8s docs of course)
While actuator allow to configure additional checks to these probes, it is suggested that initially we'll provide default implementation that users can choose to use.
Since if the Liveness State of an application is broken, Kubernetes will try to solve that problem by restarting the application instance, the “liveness” probe should not depend on health checks for external systems. Otherwise, if an external system fails (e.g. a database, a Web API, an external cache) triggering the liveness to fail, Kubernetes might restart all application instances and create cascading failures.
Also, since Kubernetes will not route traffic to an instance of the application that it's "readiness state" is unready, checking external systems must be made carefully by the application developers.
The probes should return 200 if they the server is alive/ready and >=400 otherwise (or completely fail to repsonse).
For more details on these probes, see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/.
Therefore I suggest:
- Add two new endpoints under
/pyctuator/health/probes', one for
readinessand another for
liveness`. - By default all they do is returning 200
OK
. - Allow to register custom "readiness" checks which are functions that throw an exception to indicate not-ready.
WDYT?
there's another problem. when the registration with the admin server fails, pyctuator does not serve actuator endpoints at all. this behavior leads to cascading restarts since the readiness probe fails with connection refused.
I see, Can you provide steps to reproduce? Thanks
- point registration_url to an url that does not resolve (e.g. service in k8s cluster with no running pod)
- create instance of Pyctuator (with flask)
- flask only becomes ready after a timeout of more than two minutess (Failed registering with boot-admin, [Errno 110] Connection timed out)
expected behavoir: exception is thrown much earlier
Thanks, will look into this.
Actually, assuming you run your flask application in a k8s pod too, I'm thinking if this might be a behavior of k8s. Can you try to curl to the same URL from the pod running the flask application? I wonder if you'll get an immediate answer.
Curl times out after two minutes...
Yes, I just tested it myself. Not sure that in this case I can make pyctuator timeout earlier.
How about connecting in the background? The service may be healthy without admin registration....
I agree that you don't want to block an application from starting because the monitoring system is down.
Pyctuator is using the builtin http.client.HTTPConnection
so it doesnt force you to use any library - whatever will be the solution, we need to maintain this.
Looking at http.client.HTTPConnection
, I see there's an optional timeout
that we may use, see https://docs.python.org/3/library/http.client.html
I can make this a config parameter with a default that's much lower, maybe 10s.
That‘d be great, thanks.
Oh, just noticed we are discussing this hanging within the issue asking k8s probes. Moving the discussion to #51