Cert renewal loop
Describe the bug:
As initially discussed in https://github.com/jetstack/cert-manager/issues/3813, the Vault Issuer clamps duration according to its max_ttl value, which defaults to 30 days. It also backdates notBefore by 30 seconds (https://www.vaultproject.io/api-docs/secret/pki#not_before_duration). With these default settings, this certificate will renew every 30 seconds:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: example
spec:
duration: 1440h # 60d
renewBefore: 720h # 30d
issuerRef:
kind: Issuer
name: vault
secretName: example
What happens is Vault clamps duration to 30 days, then increases it to 30 days and 30 seconds by backdating notBefore. So actual duration becomes 720h0m30s, which is 30s greater than the renewBefore value of 720h0m0s. Since renewBefore is less than actual duration, cert-manager uses this value instead of the 2/3 default:https://github.com/jetstack/cert-manager/blob/2bdda7210dd368cb6f04a2398c6da87524719b54/pkg/controller/certificates/util.go#L304
Expected behaviour:
The intent of the above spec is to have the certificate renewed 50% of the way through its validity period. It definitely should not cause a renewal loop.
The logic in https://github.com/jetstack/cert-manager/pull/4040 prevents these kinds of renewal loops, but it was lost in https://github.com/jetstack/cert-manager/pull/4092 in order to allow users to specify a renewBefore value greater than 2/3 of duration. I have 2 proposals for preventing renewal loops while respecting these constraints:
- If renewBefore is provided, scale it by the ratio of actual duration to duration. In the above example, this would result in the certificate being renewed every 15 days instead of every 30 seconds. The same renewal loop could still occur if a user set
duration=720handrenewBefore=719h59m30s, but it wouldn't occur by accident. - Enforce a minimum interval between renewals to prevent renewal loops. A sane default minimum for most use cases might be 1h, but this could be a configuration flag. This would prevent the hot renewal loops, but renewals would still be occurring much more frequently than the user had intended.
Implementing both proposals would provide the benefits of each.
Environment details::
- cert-manager version: v1.4.2 (the relevant logic is unchanged in v1.5)
/kind bug
/kind bug
Hi, thanks for the clear and detailed issue!
I also find it confusing; for me, the confusion comes from the fact that the "actual duration" is unknown when I choose a value for renewBefore, meaning I can end up with a hot loop without realizing it.
For example, when I use the Venafi Issuer with a Windows Certificate Authority template that "clamps" the duration to 2 years (which I didn't know beforehand), I choose the renewBefore value depending on the duration I give. For example, with a 2 months duration, I'd like the renew to happen 30 days before expiry:
kind: Certificate
spec:
duration: 2160h # 90d
renewBefore: 720h # 30d
But if I knew that the "actual" duration would be 2 years, I would much rather have set the renewBefore value to 6 months!
kind: Certificate
spec:
duration: 2160h # 90d (ignored)
renewBefore: 720h # 30d
status:
notAfter: 2023-10-07 # 2 years from now
renewalTime: 2023-09-07 # 30 days before expiry
After realizing that the actual duration is 2 years from now, I want to edit my Certificate and set renewBefore to 6 months instead:
kind: Certificate
spec:
duration: 2160h # 90d (ignored)
renewBefore: 4320h # 6 months
status:
notAfter: 2023-10-07 # 2 years from now
renewalTime: 2023-04-07 # 6 months before expiry
Sometimes I wish the renewBefore value didn'nt exist, and that instead we would have a renewAtPercent. Since we (mostly) never know what is going to be the "actual" duration, it would be so much easier to say "I want the renewal to happen at 70% of the actual duration". But since we probably don't want to add new fields to the Certificate resource, I guess we have to stick to renewBefore, and maybe slightly change the behavior to avoid hot loops.
If
renewBeforeis provided, scale it by the ratio of actual duration to duration.
I am afraid this may create an unexpected/surprising behavior to users of the API, adding to the confusion around the use of renewBefore. 😞
Maybe think this issue is more of a UX issue than a bug? I.e., the renewal loop is surprising when thinking that duration will be honored but is not honored.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Apologies on the delay @maelvls, I missed your response initially.
Sometimes I wish the renewBefore value didn'nt exist, and that instead we would have a renewAtPercent. Since we (mostly) never know what is going to be the "actual" duration, it would be so much easier to say "I want the renewal to happen at 70% of the actual duration". But since we probably don't want to add new fields to the Certificate resource, I guess we have to stick to renewBefore
Would you be open to making renewBefore accept either a percent or a duration? Similar to the maxUnavailable and maxSurge fields for deployments?
Maybe think this issue is more of a UX issue than a bug? I.e., the renewal loop is surprising when thinking that duration will be honored but is not honored.
I don't disagree per se, but as operators of cert-manager in a multi-tenant cluster we need some means of technical enforcement to prevent renewal loops. My proposal to optionally enforce a minimum interval between renewals (e.g. make it a parameter that defaults to 0s) could address this without changing any other semantics.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale
/remove-lifecycle stale
Would you be open to making
renewBeforeaccept either a percent or a duration? Similar to themaxUnavailableandmaxSurgefields for deployments?
@cbroglie For the record, I think that is a great idea. And we would gladly accept a PR like that.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale
@cbroglie A pre-release is now available containing the fix for this issue. Please test and feedback if you have time.
- https://github.com/cert-manager/cert-manager/releases/tag/v1.16.0-alpha.0
@wallrj I can give this a shot first thing next week and report back
@wallrj v1.16.0-alpha.0 + renewBeforePercentage seems to be working well in my testing 👍
Still need to document this new feature. If you have time please submit a PR against the cert-manager/website release-next branch:
- https://release-next--cert-manager.netlify.app/docs/releases/release-notes/release-notes-1.16#themes
- https://release-next--cert-manager.netlify.app/docs/usage/certificate/#reissuance-triggered-by-expiry-renewal
Opened https://github.com/cert-manager/website/pull/1551
A beta-release is now available which contains the fix for this issue. Please test and feedback if you have time.
- https://github.com/cert-manager/cert-manager/releases/tag/v1.16.0-beta.0