kube-lego
kube-lego copied to clipboard
FR: Prometheus metrics endpoint
Hi,
I just came across an instance where a DNS misconfiguration caused me to be rate limited after an Ingress change, and it would be really useful to be able to scrape the errors on kube-lego
with Prometheus to catch these things sooner rather than later.
Is that in progress somewhere, or would there be interest in having it contributed?
This is a very good idea. Happy to accept contributions. Maybe if you outline possible metrics first, then we can discuss before implementation
Definitely! After a bit of investigation, I think starting with:
- a gauge that reports time of last check for certificates
- a counter of all requests to Let's Encrypt
- a counter of all completed requests to Let's Encrypt
- a counter of successfully requested certificates, which (in theory) could allow figuring out where you are in relation to the rate limit
Adding a label representing the reason for the request (renewal, new certificate, etc) would be great too, but I haven't delved far enough in to gauge difficulty.
I had hoped to be able to add a label including specific error conditions on completed requests, but it appears that's all being shown by strings returned from Let's Encrypt, and attempting to do a matcher could be very brittle. Any thoughts there?
Sorry it took quite a while to come back on this. I think that are all valid metrics.
I was thinking maybe you add a gauge for the expiry date of the checked certificates. This is checked anyhow and it would be a good idea to have alerts on expiring certs.
No worries! I'm also pulled in a number of directions right now. I'll get to this shortly, but if anyone else wants to grab it feel free.
Regarding the gauge for expiry date: Do you have any idea what the upper bound is on number of certificates requested? It seems like this could have a high cardinality in some scenarios.
Just had a problem with rate limiting due to misconfiguration of DNS (too many failed requests). It would be nice to have a metric on failed attempts which I could get an alert on.
I took a stab here and added some basic metrics https://github.com/jetstack/kube-lego/pull/231