postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

feat: Add configurable liveness probe

Open davidspek opened this issue 3 years ago • 7 comments

Closes: https://github.com/zalando/postgres-operator/issues/58, https://github.com/zalando/postgres-operator/issues/1703

This PR adds a new field to the Operator configuration parameters and the Postgres spec allowing users too specify a liveness probe for the Spilo container. This is necessary too resolve errors with Postgres going down if communication with the K8s API temporarily fails, as described in this issue.

~~The Postgres CRD managed by the operator doesn't seem to have the new field added, trying to fix that still. Also, it doesn't seem like the statefulset is updated when the liveness probe is changed which still needs fixing.~~

The CRD managed by the operator now includes the new field and changes to the liveness and readiness probes now properly update the statefulset.

davidspek avatar Jul 25 '22 13:07 davidspek

LivenessProbes are dangerous. :fire: :fire: :fire: I'm fine with a toggle so that anybody, who wants to risk it, can go with them. But, I'd prefer just one toggle and the operator creating a probe with a setup we think works for most cases. Like I proposed it for the ReadinessProbe here #2004 . I don't want to have a full spec option in the CRD manifests. Chances of misuse would be even higher then.

FxKu avatar Aug 23 '22 16:08 FxKu

Imho the point of the article is not that Liveness probes are dangerous per se. You just have to use them right.

I even would go so far to say that a deployment which can't properly crash and restart automatically because of a missing liveness probe is just broken, because that's one of the essential things using a container in kubernetes.

Therefore it should be configurable at least even if not used by default.

Currently we have to use an ugly workaround to get our postgres instances running reliably: https://github.com/monotek/patroni-selfheal

monotek avatar Aug 24 '22 10:08 monotek

I would still favor one pre-configured liveness probe that works for the operator and have only one toggle for users to switch.

FxKu avatar Aug 29 '22 17:08 FxKu

We actually developed this as we didn’t want to use patroni-selfheal, and have been using it in prod on multiple cluster for 2 months now without issues and solving the problem we had. The main problem with a single liveness probe toggle is that it will likely not work in many scenarios, similarly to the readiness probe. One example is if a replica needs to clone a very large DB from the master, the probes would need to be configurable for this use case specifically. We were actually planning on making the readiness probe configurable as well because of this.

In terms of errors configuring the probes, they have the same type and validation as Kubernetes so that should take care of part of it. We can also add docs, examples, etc to inform people how to configure the probe and give a good starting point they can use.

davidspek avatar Sep 19 '22 21:09 davidspek

@FxKu friendly bump since I believe this is a pretty essential feature for the operator.

davidspek avatar Nov 01 '22 16:11 davidspek

@FxKu Friendly bump since I've just rebased on the latest release.

davidspek avatar May 16 '23 09:05 davidspek

Hi, having this PR and feature merged seems a very good improvement. Any luck that it will happen?

aleskandro avatar Nov 23 '23 04:11 aleskandro