Add metric to monitor expiration dates of certificates in Lemur
Metric(s) which monitor the seconds until each certificate expires would be useful for quickly answering questions like "which certificates are expiring in 3 months?" and to build alert rules to detect deployed certificates that are expiring for example.
At a high level, this could potentially be implemented as another periodical task unless there are other recommended ways for doing so. Ideally, such a metric would provide tags to (1) quickly filter down on specific certificates based on name and (2) quickly filter down on certificates which have endpoints associated with them versus ones that do not.
Relates to #3870.
Hi @bobmshannon, thanks for initiating this conversation.
Summarizing the mechanisms we have at hand today, and which we could expand or revise:
a) the DASHBOARD page of Lemur UI, gives some high level statistics, one of them being about expiring certs.
https://github.com/Netflix/lemur/blob/master/lemur/certificates/service.py#L824-L833
The Dashboard could certainly benefit from some updates to make it more useful, for instance, being able to control the expiration window, and potentially new statistics.
The dashboard won't however show the individual expiring certs.
b) expiration notification This is currently set two 30 days, and usually an opportunity to alert the individual teams, or the secop team https://github.com/Netflix/lemur/blob/dbc60e749d4cca2d4ab2ef42d8b57563ee738fa2/docker/src/lemur.conf.py#L125-L138
To make easier for the security team to follow all the expirations, we also created a summary of upcoming expirations https://github.com/Netflix/lemur/blob/dbc60e749d4cca2d4ab2ef42d8b57563ee738fa2/docker/src/lemur.conf.py#L139-L144
b) Notify_expiring_deployed_certificates Lemur can also detect and alert when identifying an expiring cert is still deployed. This, however, only works for non wildcard certs, and when common ports are used.
https://github.com/Netflix/lemur/blob/7a80ae629b86a396887e89132ee5a00c6ffc2ecc/lemur/common/celery.py#L967
Regarding certificates which have endpoints associated with, there is a celery job to enable auto-rotation by default. We want to also simplify that, and enable auto-rotate when an endpoint with certificate is discovered.
Thanks for the overview @hosseinsh. These are all great notification mechanisms when used within the right context.
I think the value provided by augmenting these mechanisms with metrics is centered around two major use-cases:
(1) a flexible way to visualize certificate expiration information in other dashboard tools so that it can be consolidated in a single view with other infrastructure metrics (2) a method to conveniently setup alerts via whatever monitoring system is in use so one does not need to rely solely on e-mail notifications
I'm not sure if that resonates at all as something which could be useful for other users of Lemur as well. I do realize the existing notification mechanisms might be catered towards different use-cases or organizational structure for how PKI infra is managed however.
I think one of the more trickier bits to get such a metric right is keeping track of certificates which are associated with an endpoint or have been replaced. For example, a user might not want to monitor a certificate's expiration date which is not associated with an endpoint. Or a user might not want to monitor a certificate's expiration date which has already been re-issued or replaced.
plus one, on keeping the noise low, specially when no action is required for instance, cert is already rotated on an endpoint. One common case we come across, is when there is no endpoint attached, but the cert is rotated, in such settings manual action might still be required, hence the notification or alerting.
I think both use-cases (1) and (2) would make sense. The expiry notification tasks runs daily, and could be augmented to support generating the required metrics, if not already the case. https://github.com/Netflix/lemur/blob/deb16d2edf97aa66cee67e13ed4bd5376ab21ad8/lemur/notifications/messaging.py#L372-L377