cert-manager
cert-manager copied to clipboard
Adding probes to the cert-manager pods
Is your feature request related to a problem? Please describe. As part of Kubernetes' best practices, I'd like to set Readiness and Liveness probes on all the containers deployed on my infrastructure. As of now, the charts in cert-manager lack this capability.
Describe the solution you'd like An easy solution would be hard coding the probes to the cert-manager templates (cert-manager and cert-manager-cainjector). A more complete solution would be making those probes customizable via values.yaml.
Describe alternatives you've considered At the moment the only alternative to set up probes would be manually patching the deployments after helm install, which isn't the best practice devops-wise, and unfeasible for automated solutions on a wide array of clusters.
/kind feature
This is an interesting thing to think about. How would we define the readiness and lifeness of a Kubernetes operator? Since the controller and cainjector are not serving any traffic we can not test a port to be ready. Also how would we handle the state if the controller is up but is not the leader. How would we actually verify that the controller is up and working? We don't want to add probes that give a false sense of security. We have these on the webhook where it makes sense as it is an actual HTTPS endpoint serving traffic, which has services relying on it reporting it's state so endpoints can be controlled.
I took a look at how other Kubernetes operators do this and all ones I use also don't have these probes. Feedback welcome!
We actually ran into this too just now and were surprised that there is no health check.
There is a metrics endpoint available (https://github.com/jetstack/cert-manager/blob/master/deploy/charts/cert-manager/templates/deployment.yaml#L106), so a possibility would be to scrape that endpoint to check if the container is up & running. There's quite a few tools that do it like that afaik.
In the case we just had, the metrics-endpoint stopped serving traffic when the controller went into some kind of busy-loop. We noticed sometime later and killed / restarted the pod, which resolved the issue. It would be nice if this was handled by the probes instead.
Maybe it's not possible to distinct between ready & alive, but maybe that's also not needed.
interesting idea, i never had CM lock up before so couldn't probe it. I do wonder with the recent refactor of our metrics if it would still be the same. Will try to look into it
/area deploy
/priority important-longterm
This issue is also valid if you want to use Cert manager in clusters that should be SOC 2 compliant because they're checked according to OWASP and missing liveness and readiness probes qualifies as Security misconfiguration. So it would be really nice if Cert manager just had some kind /status endpoint that could be used for those probes.
- https://github.com/jetstack/cert-manager/pull/4133
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
I just had the second cert-manager not reconciling for some reason any more. I had to delete and restart the pod. A liveness probe would have catched this incident.
With kube-builder, we're using a function that validates the cache is in sync and register this as liveness probe in our daemons when starting the managers.
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Note that probes would likely not be a replacement for running something like cmctl check api in CI before applying resources as you'd still need to verify that the whole system (controller, webhook and cainjector) functions before resources can be applied.
I can see that probes could be useful if there sometimes is a need to restart the pod- would be good to understand more in what cases this happens and perhaps look at how other projects do it if there are open source examples.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
So this also caught my attention since this is the last Helm Chart in my cluster NOT having probes. Is there anything the community can help with? A simple first example that is coded into the HelmChart is very easy to contribute. Replacing the cmctl check api functionality with this is a whole different size of a contribution though. Any advice from maintainer side to this topic?
Bump.
We are also encountering this. The cert manager itself has the metrics port that can be checked via tcp probes.
The cainjector however does not expose any ports and does not have usable tools installed to perform checks via an exec probe.
The best approach in my opinion would be to just add health check endpoints to all services and probe them.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
any news on this request? Liveness and readiness probes are very important for reliability ...
There's been further discussion in Slack:
- https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1665603954227279
And in:
- https://github.com/cert-manager/cert-manager/pull/5670#pullrequestreview-1235725002
We are reaching the conclusion that there may be a useful liveness probe that we can add, based on the leader election library:
https://github.com/kubernetes/client-go/blob/v0.26.3/tools/leaderelection/healthzadaptor.go#L25-L36
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
Nope. This is still desired functionality.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale
Nope. This one stays open. :)
/remove-lifecycle rotten
~Not having the /healthz endpoint tied to the leader election means that in case of a cert-manager upgrade, the Prometheus scraper may hit the wrong /metrics endpoint until the upgrade has finished. I don't think the /metrics endpoint should work when the process hasn't been elected.~
I was wrong: as detailed in the page Use Liveness Probes, cert-manager properly exits if the leader election fails. So I shouldn't worry about /metrics.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale