strimzi-kafka-operator icon indicating copy to clipboard operation
strimzi-kafka-operator copied to clipboard

[Enhancement] Provide metrics to monitor certificates expiration

Open OuesFa opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. Lack of visibility regarding the validity period of certificates created by the cluster & user operators.

Describe the solution you'd like Expose metrics to monitor the expiration of the certificates through visualisations and alerts.

OuesFa avatar Oct 07 '20 22:10 OuesFa

Triaged on 12.4.2022: This makes sense for the CAs:

  • Metric can be provided as days until expiration
  • Should be included in the Grafana dashboards and sample alerts

User certificates are not that easy because the User Operator doesn't know whether the certificate is actually used by the client or not. This would be better solved at the client side.

scholzj avatar Apr 12 '22 14:04 scholzj

Thanks for the update. How "difficult" you think this contribution would be, and how long it would take, compared to a contribution like this one for example https://github.com/strimzi/strimzi-kafka-operator/pull/5413 I can work on this if you think it is reasonable that someone who is not that familiar with the operator's code can do.

OuesFa avatar Apr 12 '22 15:04 OuesFa

I think this is harder than #5413. There are two parts to this:

  1. Is getting the actual information about the days till expiration. This might IMHO not be that hard.

  2. Exposing the metrics. I think this will be the hard part. Because here you need:

    • Have some shared metric which will agregate this for multiple Kafka clusters (since the operator might manage more of them, each with their own CAs and metrics)
    • Make sure to add the metrics for new and existing clusters
    • Make sure to remove the metrics for deleted clusters

    And I think this might not be completely easy.

Of course if you wanna look into it, we will try or best to help you.

scholzj avatar Apr 12 '22 15:04 scholzj

@scholzj @maciej-tatarski and I would like to work on this, we have a solution in our company, not directly using Strimzi but build around and have some suggestions for dashboards as well.

steffen-karlsson avatar Mar 14 '24 12:03 steffen-karlsson

One suggestion would be to expose the actual epoch of the expiration date instead of days to expiration as it is easier and more flexible to work with.

maciej-tatarski avatar Mar 14 '24 12:03 maciej-tatarski

Can you elaborate a bit more on it?

  • Why do you think the epoch of the actual expiration is more useful? It seems to me that the number of days makes it super easy to evaluate it and read. Although I guess at the end you can usually convert them quite easily if needed.
  • Also, how do you monitor the expirations if not directly using Strimzi? In the past, I thought that the best way to implement this might be through a separate tool called something like strimzi-state-metrics that would provide these additional metrics as some of them are hard to integrate directly into the operators and add a lot of complexity that way.

scholzj avatar Mar 14 '24 13:03 scholzj

@maciej-tatarski will elaborate on the epoch :)

@scholzj Regarding our existent solution, we don't plan to use that as part of this implementation, rather decom it when this is done.

What we have done, is to implement dashboards and alerts on certs based on a K8s CronJob to monitor our external secret store and emit metrics that way, because we were missing this.

What we want to do in this solution is to emit the metric when a cluster is created, or secret is updated and remove it when a cluster is deleted, i.e. in the operator-common and cluster-operator.

steffen-karlsson avatar Mar 14 '24 14:03 steffen-karlsson

I think epoch is better because in grafana you can easily visualize it as a date or days to expiry, because it fits default grafana time format.. Additionally it gives you more precise data, as it is in seconds.

maciej-tatarski avatar Mar 14 '24 15:03 maciej-tatarski

I think epoch is better because in grafana you can easily visualize it as a date or days to expiry, because it fits default grafana time format.. Additionally it gives you more precise data, as it is in seconds.

Ok, I guess that makes sense. We would still need to figure out what would be the best way to expose these metrics. One of the main issues is how to cleanly remove them when the cluster is deleted.

scholzj avatar Mar 14 '24 15:03 scholzj

I see, there are no callbacks or anything on deletion in the cluster-operator currently that we can hook into?

steffen-karlsson avatar Mar 14 '24 15:03 steffen-karlsson

To be honest, I do not remember the details exactly. But in general, the deletion is done by Kubernetes and its garbage collection. It is not always simple to remove the metrics for the deleted resources. But if you don't do it, they usually stay set until the operator restarts.

scholzj avatar Mar 14 '24 15:03 scholzj

Makes sense, @maciej-tatarski and I would gladly give it a go and see if we can come up with anything meaningful :)

steffen-karlsson avatar Mar 14 '24 15:03 steffen-karlsson

Ok, great. That sounds like a plan then.

scholzj avatar Mar 14 '24 17:03 scholzj