kube-lego icon indicating copy to clipboard operation
kube-lego copied to clipboard

FR: Prometheus metrics endpoint

Open ohaiwalt opened this issue 7 years ago • 6 comments

Hi,

I just came across an instance where a DNS misconfiguration caused me to be rate limited after an Ingress change, and it would be really useful to be able to scrape the errors on kube-lego with Prometheus to catch these things sooner rather than later.

Is that in progress somewhere, or would there be interest in having it contributed?

ohaiwalt avatar May 02 '17 18:05 ohaiwalt

This is a very good idea. Happy to accept contributions. Maybe if you outline possible metrics first, then we can discuss before implementation

simonswine avatar May 07 '17 16:05 simonswine

Definitely! After a bit of investigation, I think starting with:

  • a gauge that reports time of last check for certificates
  • a counter of all requests to Let's Encrypt
  • a counter of all completed requests to Let's Encrypt
  • a counter of successfully requested certificates, which (in theory) could allow figuring out where you are in relation to the rate limit

Adding a label representing the reason for the request (renewal, new certificate, etc) would be great too, but I haven't delved far enough in to gauge difficulty.

I had hoped to be able to add a label including specific error conditions on completed requests, but it appears that's all being shown by strings returned from Let's Encrypt, and attempting to do a matcher could be very brittle. Any thoughts there?

ohaiwalt avatar May 11 '17 21:05 ohaiwalt

Sorry it took quite a while to come back on this. I think that are all valid metrics.

I was thinking maybe you add a gauge for the expiry date of the checked certificates. This is checked anyhow and it would be a good idea to have alerts on expiring certs.

simonswine avatar May 18 '17 10:05 simonswine

No worries! I'm also pulled in a number of directions right now. I'll get to this shortly, but if anyone else wants to grab it feel free.

Regarding the gauge for expiry date: Do you have any idea what the upper bound is on number of certificates requested? It seems like this could have a high cardinality in some scenarios.

ohaiwalt avatar May 27 '17 02:05 ohaiwalt

Just had a problem with rate limiting due to misconfiguration of DNS (too many failed requests). It would be nice to have a metric on failed attempts which I could get an alert on.

hannson avatar Jun 02 '17 00:06 hannson

I took a stab here and added some basic metrics https://github.com/jetstack/kube-lego/pull/231

arbarlow avatar Jul 17 '17 16:07 arbarlow