cert-manager icon indicating copy to clipboard operation
cert-manager copied to clipboard

Adding probes to the cert-manager pods

Open blame19 opened this issue 5 years ago • 33 comments

Is your feature request related to a problem? Please describe. As part of Kubernetes' best practices, I'd like to set Readiness and Liveness probes on all the containers deployed on my infrastructure. As of now, the charts in cert-manager lack this capability.

Describe the solution you'd like An easy solution would be hard coding the probes to the cert-manager templates (cert-manager and cert-manager-cainjector). A more complete solution would be making those probes customizable via values.yaml.

Describe alternatives you've considered At the moment the only alternative to set up probes would be manually patching the deployments after helm install, which isn't the best practice devops-wise, and unfeasible for automated solutions on a wide array of clusters.

/kind feature

blame19 avatar Jul 17 '20 09:07 blame19

This is an interesting thing to think about. How would we define the readiness and lifeness of a Kubernetes operator? Since the controller and cainjector are not serving any traffic we can not test a port to be ready. Also how would we handle the state if the controller is up but is not the leader. How would we actually verify that the controller is up and working? We don't want to add probes that give a false sense of security. We have these on the webhook where it makes sense as it is an actual HTTPS endpoint serving traffic, which has services relying on it reporting it's state so endpoints can be controlled.

I took a look at how other Kubernetes operators do this and all ones I use also don't have these probes. Feedback welcome!

meyskens avatar Jul 20 '20 08:07 meyskens

We actually ran into this too just now and were surprised that there is no health check.

There is a metrics endpoint available (https://github.com/jetstack/cert-manager/blob/master/deploy/charts/cert-manager/templates/deployment.yaml#L106), so a possibility would be to scrape that endpoint to check if the container is up & running. There's quite a few tools that do it like that afaik.

In the case we just had, the metrics-endpoint stopped serving traffic when the controller went into some kind of busy-loop. We noticed sometime later and killed / restarted the pod, which resolved the issue. It would be nice if this was handled by the probes instead.

Maybe it's not possible to distinct between ready & alive, but maybe that's also not needed.

tommyknows avatar Jul 21 '20 07:07 tommyknows

interesting idea, i never had CM lock up before so couldn't probe it. I do wonder with the recent refactor of our metrics if it would still be the same. Will try to look into it

/area deploy

meyskens avatar Jul 22 '20 15:07 meyskens

/priority important-longterm

hzhou97 avatar Aug 04 '20 13:08 hzhou97

This issue is also valid if you want to use Cert manager in clusters that should be SOC 2 compliant because they're checked according to OWASP and missing liveness and readiness probes qualifies as Security misconfiguration. So it would be really nice if Cert manager just had some kind /status endpoint that could be used for those probes.

zxpower avatar Apr 08 '21 10:04 zxpower

  • https://github.com/jetstack/cert-manager/pull/4133

wallrj avatar Jun 24 '21 11:06 wallrj

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Sep 22 '21 11:09 jetstack-bot

/remove-lifecycle stale

mindw avatar Sep 22 '21 12:09 mindw

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Dec 21 '21 12:12 jetstack-bot

I just had the second cert-manager not reconciling for some reason any more. I had to delete and restart the pod. A liveness probe would have catched this incident.

With kube-builder, we're using a function that validates the cache is in sync and register this as liveness probe in our daemons when starting the managers.

/remove-lifecycle stale

JensErat avatar Jan 18 '22 16:01 JensErat

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Apr 18 '22 16:04 jetstack-bot

/remove-lifecycle stale

mindw avatar Apr 18 '22 17:04 mindw

Note that probes would likely not be a replacement for running something like cmctl check api in CI before applying resources as you'd still need to verify that the whole system (controller, webhook and cainjector) functions before resources can be applied.

I can see that probes could be useful if there sometimes is a need to restart the pod- would be good to understand more in what cases this happens and perhaps look at how other projects do it if there are open source examples.

irbekrm avatar Jun 28 '22 18:06 irbekrm

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Sep 26 '22 19:09 jetstack-bot

/remove-lifecycle stale

mindw avatar Sep 26 '22 20:09 mindw

So this also caught my attention since this is the last Helm Chart in my cluster NOT having probes. Is there anything the community can help with? A simple first example that is coded into the HelmChart is very easy to contribute. Replacing the cmctl check api functionality with this is a whole different size of a contribution though. Any advice from maintainer side to this topic?

SchoolGuy avatar Sep 27 '22 14:09 SchoolGuy

Bump.

We are also encountering this. The cert manager itself has the metrics port that can be checked via tcp probes.

The cainjector however does not expose any ports and does not have usable tools installed to perform checks via an exec probe.

The best approach in my opinion would be to just add health check endpoints to all services and probe them.

Useurmind avatar Dec 12 '22 13:12 Useurmind

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Mar 12 '23 14:03 jetstack-bot

/remove-lifecycle stale

SchoolGuy avatar Mar 12 '23 16:03 SchoolGuy

any news on this request? Liveness and readiness probes are very important for reliability ...

nlamirault avatar Mar 31 '23 16:03 nlamirault

There's been further discussion in Slack:

  • https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1665603954227279

And in:

  • https://github.com/cert-manager/cert-manager/pull/5670#pullrequestreview-1235725002

We are reaching the conclusion that there may be a useful liveness probe that we can add, based on the leader election library:

https://github.com/kubernetes/client-go/blob/v0.26.3/tools/leaderelection/healthzadaptor.go#L25-L36

wallrj avatar Mar 31 '23 16:03 wallrj

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Jun 29 '23 17:06 jetstack-bot

Nope. This is still desired functionality.

SchoolGuy avatar Jun 29 '23 17:06 SchoolGuy

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle rotten /remove-lifecycle stale

jetstack-bot avatar Jul 29 '23 17:07 jetstack-bot

Nope. This one stays open. :)

SchoolGuy avatar Jul 30 '23 06:07 SchoolGuy

/remove-lifecycle rotten

mindw avatar Jul 30 '23 08:07 mindw

~Not having the /healthz endpoint tied to the leader election means that in case of a cert-manager upgrade, the Prometheus scraper may hit the wrong /metrics endpoint until the upgrade has finished. I don't think the /metrics endpoint should work when the process hasn't been elected.~

I was wrong: as detailed in the page Use Liveness Probes, cert-manager properly exits if the leader election fails. So I shouldn't worry about /metrics.

maelvls avatar Sep 25 '23 16:09 maelvls

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

jetstack-bot avatar Dec 25 '23 18:12 jetstack-bot

/remove-lifecycle stale

mindw avatar Dec 26 '23 11:12 mindw

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

jetstack-bot avatar Mar 25 '24 11:03 jetstack-bot