strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard
[Enhancement] Provide metrics to monitor certificates expiration
Is your feature request related to a problem? Please describe. Lack of visibility regarding the validity period of certificates created by the cluster & user operators.
Describe the solution you'd like Expose metrics to monitor the expiration of the certificates through visualisations and alerts.
Triaged on 12.4.2022: This makes sense for the CAs:
- Metric can be provided as days until expiration
- Should be included in the Grafana dashboards and sample alerts
User certificates are not that easy because the User Operator doesn't know whether the certificate is actually used by the client or not. This would be better solved at the client side.
Thanks for the update. How "difficult" you think this contribution would be, and how long it would take, compared to a contribution like this one for example https://github.com/strimzi/strimzi-kafka-operator/pull/5413 I can work on this if you think it is reasonable that someone who is not that familiar with the operator's code can do.
I think this is harder than #5413. There are two parts to this:
-
Is getting the actual information about the days till expiration. This might IMHO not be that hard.
-
Exposing the metrics. I think this will be the hard part. Because here you need:
- Have some shared metric which will agregate this for multiple Kafka clusters (since the operator might manage more of them, each with their own CAs and metrics)
- Make sure to add the metrics for new and existing clusters
- Make sure to remove the metrics for deleted clusters
And I think this might not be completely easy.
Of course if you wanna look into it, we will try or best to help you.
@scholzj @maciej-tatarski and I would like to work on this, we have a solution in our company, not directly using Strimzi but build around and have some suggestions for dashboards as well.
One suggestion would be to expose the actual epoch of the expiration date instead of days to expiration
as it is easier and more flexible to work with.
Can you elaborate a bit more on it?
- Why do you think the epoch of the actual expiration is more useful? It seems to me that the number of days makes it super easy to evaluate it and read. Although I guess at the end you can usually convert them quite easily if needed.
- Also, how do you monitor the expirations if not directly using Strimzi? In the past, I thought that the best way to implement this might be through a separate tool called something like
strimzi-state-metrics
that would provide these additional metrics as some of them are hard to integrate directly into the operators and add a lot of complexity that way.
@maciej-tatarski will elaborate on the epoch :)
@scholzj Regarding our existent solution, we don't plan to use that as part of this implementation, rather decom it when this is done.
What we have done, is to implement dashboards and alerts on certs based on a K8s CronJob to monitor our external secret store and emit metrics that way, because we were missing this.
What we want to do in this solution is to emit the metric when a cluster is created, or secret is updated and remove it when a cluster is deleted, i.e. in the operator-common
and cluster-operator
.
I think epoch is better because in grafana you can easily visualize it as a date or days to expiry, because it fits default grafana time format.. Additionally it gives you more precise data, as it is in seconds.
I think epoch is better because in grafana you can easily visualize it as a date or days to expiry, because it fits default grafana time format.. Additionally it gives you more precise data, as it is in seconds.
Ok, I guess that makes sense. We would still need to figure out what would be the best way to expose these metrics. One of the main issues is how to cleanly remove them when the cluster is deleted.
I see, there are no callbacks or anything on deletion in the cluster-operator
currently that we can hook into?
To be honest, I do not remember the details exactly. But in general, the deletion is done by Kubernetes and its garbage collection. It is not always simple to remove the metrics for the deleted resources. But if you don't do it, they usually stay set until the operator restarts.
Makes sense, @maciej-tatarski and I would gladly give it a go and see if we can come up with anything meaningful :)
Ok, great. That sounds like a plan then.