cert-manager Cert renewal loop

Describe the bug:

As initially discussed in https://github.com/jetstack/cert-manager/issues/3813, the Vault Issuer clamps duration according to its max_ttl value, which defaults to 30 days. It also backdates notBefore by 30 seconds (https://www.vaultproject.io/api-docs/secret/pki#not_before_duration). With these default settings, this certificate will renew every 30 seconds:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: example
spec:
  duration: 1440h   # 60d
  renewBefore: 720h # 30d
  issuerRef:
    kind: Issuer
    name: vault
  secretName: example

What happens is Vault clamps duration to 30 days, then increases it to 30 days and 30 seconds by backdating notBefore. So actual duration becomes 720h0m30s, which is 30s greater than the renewBefore value of 720h0m0s. Since renewBefore is less than actual duration, cert-manager uses this value instead of the 2/3 default:https://github.com/jetstack/cert-manager/blob/2bdda7210dd368cb6f04a2398c6da87524719b54/pkg/controller/certificates/util.go#L304

Expected behaviour:

The intent of the above spec is to have the certificate renewed 50% of the way through its validity period. It definitely should not cause a renewal loop.

The logic in https://github.com/jetstack/cert-manager/pull/4040 prevents these kinds of renewal loops, but it was lost in https://github.com/jetstack/cert-manager/pull/4092 in order to allow users to specify a renewBefore value greater than 2/3 of duration. I have 2 proposals for preventing renewal loops while respecting these constraints:

If renewBefore is provided, scale it by the ratio of actual duration to duration. In the above example, this would result in the certificate being renewed every 15 days instead of every 30 seconds. The same renewal loop could still occur if a user set duration=720h and renewBefore=719h59m30s, but it wouldn't occur by accident.
Enforce a minimum interval between renewals to prevent renewal loops. A sane default minimum for most use cases might be 1h, but this could be a configuration flag. This would prevent the hot renewal loops, but renewals would still be occurring much more frequently than the user had intended.

Implementing both proposals would provide the benefits of each.

Environment details::

cert-manager version: v1.4.2 (the relevant logic is unchanged in v1.5)

/kind bug

Sep 01 '21 23:09 cbroglie

/kind bug

Oct 07 '21 07:10 cbroglie

Hi, thanks for the clear and detailed issue!

I also find it confusing; for me, the confusion comes from the fact that the "actual duration" is unknown when I choose a value for renewBefore, meaning I can end up with a hot loop without realizing it.

For example, when I use the Venafi Issuer with a Windows Certificate Authority template that "clamps" the duration to 2 years (which I didn't know beforehand), I choose the renewBefore value depending on the duration I give. For example, with a 2 months duration, I'd like the renew to happen 30 days before expiry:

kind: Certificate
spec:
  duration: 2160h   # 90d
  renewBefore: 720h # 30d

But if I knew that the "actual" duration would be 2 years, I would much rather have set the renewBefore value to 6 months!

kind: Certificate
spec:
  duration: 2160h   # 90d (ignored)
  renewBefore: 720h # 30d
status:
  notAfter: 2023-10-07 # 2 years from now
  renewalTime: 2023-09-07 # 30 days before expiry

After realizing that the actual duration is 2 years from now, I want to edit my Certificate and set renewBefore to 6 months instead:

kind: Certificate
spec:
  duration: 2160h    # 90d (ignored)
  renewBefore: 4320h # 6 months
status:
  notAfter: 2023-10-07 # 2 years from now
  renewalTime: 2023-04-07 # 6 months before expiry

Sometimes I wish the renewBefore value didn'nt exist, and that instead we would have a renewAtPercent. Since we (mostly) never know what is going to be the "actual" duration, it would be so much easier to say "I want the renewal to happen at 70% of the actual duration". But since we probably don't want to add new fields to the Certificate resource, I guess we have to stick to renewBefore, and maybe slightly change the behavior to avoid hot loops.

If renewBefore is provided, scale it by the ratio of actual duration to duration.

I am afraid this may create an unexpected/surprising behavior to users of the API, adding to the confusion around the use of renewBefore. 😞

Maybe think this issue is more of a UX issue than a bug? I.e., the renewal loop is surprising when thinking that duration will be honored but is not honored.

Oct 07 '21 10:10 maelvls

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Jan 05 '22 11:01 jetstack-bot

/remove-lifecycle stale

Jan 05 '22 19:01 cbroglie

Apologies on the delay @maelvls, I missed your response initially.

Sometimes I wish the renewBefore value didn'nt exist, and that instead we would have a renewAtPercent. Since we (mostly) never know what is going to be the "actual" duration, it would be so much easier to say "I want the renewal to happen at 70% of the actual duration". But since we probably don't want to add new fields to the Certificate resource, I guess we have to stick to renewBefore

Would you be open to making renewBefore accept either a percent or a duration? Similar to the maxUnavailable and maxSurge fields for deployments?

Maybe think this issue is more of a UX issue than a bug? I.e., the renewal loop is surprising when thinking that duration will be honored but is not honored.

I don't disagree per se, but as operators of cert-manager in a multi-tenant cluster we need some means of technical enforcement to prevent renewal loops. My proposal to optionally enforce a minimum interval between renewals (e.g. make it a parameter that defaults to 0s) could address this without changing any other semantics.

Jan 05 '22 19:01 cbroglie

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Apr 05 '22 19:04 jetstack-bot

/remove-lifecycle stale

Apr 12 '22 05:04 cbroglie

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Jul 11 '22 06:07 jetstack-bot

/remove-lifecycle stale

Jul 12 '22 05:07 cbroglie

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Oct 10 '22 06:10 jetstack-bot

/remove-lifecycle stale

Oct 10 '22 17:10 cbroglie

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Jan 08 '23 17:01 jetstack-bot

/remove-lifecycle stale

Jan 08 '23 18:01 cbroglie

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Apr 08 '23 18:04 jetstack-bot

/remove-lifecycle stale

Apr 10 '23 20:04 cbroglie

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Jul 09 '23 20:07 jetstack-bot

/remove-lifecycle stale

Jul 10 '23 18:07 cbroglie

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to jetstack. /lifecycle stale

Oct 08 '23 19:10 jetstack-bot

/remove-lifecycle stale

Oct 08 '23 19:10 terinjokes

Would you be open to making renewBefore accept either a percent or a duration? Similar to the maxUnavailable and maxSurge fields for deployments?

@cbroglie For the record, I think that is a great idea. And we would gladly accept a PR like that.

Dec 18 '23 14:12 inteon

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

Mar 17 '24 15:03 jetstack-bot

/remove-lifecycle stale

Mar 18 '24 06:03 inteon

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. /lifecycle stale

Jun 16 '24 06:06 cert-manager-bot

/remove-lifecycle stale

Jun 21 '24 06:06 cbroglie

@cbroglie A pre-release is now available containing the fix for this issue. Please test and feedback if you have time.

https://github.com/cert-manager/cert-manager/releases/tag/v1.16.0-alpha.0

Jul 24 '24 16:07 wallrj

@wallrj I can give this a shot first thing next week and report back

Aug 01 '24 19:08 cbroglie

@wallrj v1.16.0-alpha.0 + renewBeforePercentage seems to be working well in my testing 👍

Aug 06 '24 16:08 cbroglie

Still need to document this new feature. If you have time please submit a PR against the cert-manager/website release-next branch:

https://release-next--cert-manager.netlify.app/docs/releases/release-notes/release-notes-1.16#themes
https://release-next--cert-manager.netlify.app/docs/usage/certificate/#reissuance-triggered-by-expiry-renewal

Aug 06 '24 16:08 wallrj

Opened https://github.com/cert-manager/website/pull/1551

Sep 09 '24 02:09 cbroglie

A beta-release is now available which contains the fix for this issue. Please test and feedback if you have time.

https://github.com/cert-manager/cert-manager/releases/tag/v1.16.0-beta.0

Sep 26 '24 19:09 wallrj