consul-template icon indicating copy to clipboard operation
consul-template copied to clipboard

Configurable pkiCert rendering interval

Open fitz123 opened this issue 3 years ago • 4 comments

Consul Template version

Vault v1.11.3 (17250b25303c6418c283c95b1d5a9c9f16174fe8), built 2022-08-26T10:27:10Z

Configuration

template {
  source = "/etc/vault.d/templates/dynamic-cert-chained.tpl"
  destination = "/run/vault-agent/approle/fullchain.pem"
  command = "sudo /usr/sbin/nginx -s reload || sudo /bin/systemctl start nginx"
  perms = 0640
}
{{ with pkiCert "pki/issue/approle" "common_name=approle.domain.com" "ttl=1176h" -}}
{{ .Cert -}}
{{ with secret "pki/cert/ca_chain" -}}
{{ .Data.certificate }}
{{- end }}{{ if .Key -}}
{{ .Key | writeToFile "/run/vault-agent/approle/privkey.pem" "vault" "bin" "0640" -}}
{{ end -}}
{{ end -}}

Expected behavior

Ability to set rendering interval for the pki certificates, similar to static_secret_render_interval but for pkiCert templating function. We want to renew certificate when 15% of the TTL is reached.

Actual behavior

Currently vault-agent (consul-tempate) re-issues certificates when 85% of the secrets time-to-live (TTL) is reached and this is not configurable.

References

  • GH-1234
  • https://github.com/hashicorp/vault/issues/17306

fitz123 avatar Sep 26 '22 16:09 fitz123

Hey @fitz123, thanks for taking the time to file this.

If you have a moment would you mind explaining the use case for a renewing at 15% of TTL? Not that I doubt your need or anything, it just helps me to understand the use cases so I can take them into account going forward. Thanks!

eikenb avatar Sep 26 '22 18:09 eikenb

@eikenb I think they elaborated more on the Vault issue, but they have a 7-day internal SLA on resolving Vault outages right now; Vault may be down for 6.999 days or so, but no more (theoretically).

They want to drop the certificate lifetime to be shorter, but in order to satisfy that SLA with the default 85% window, they need to have the cert lifetime be at least 7/(1-0.85) = 46 days, in order for renewal at 85% to be greater than 7 days (and thus, not risk Vault being down). Dropping to 50% would allow say, a 15-day cert to be issued (while still having that window be greater than 7 days), and a 15% would allow say, an 8-10 day cert (I believe).

cipherboy avatar Sep 26 '22 19:09 cipherboy

Thanks for the explanation! I should have read the vault ticket as it does lay out their use case. Basically they need certs to always have 7+ days left on them so they can deal with the vault's 7 day SLA. Where using 85% would mean they would need 46 day TTLs to have that 7 day buffer.

One wrinkle that will probably come up... the TTL checking code currently has a minimum duration of 10% of the TTL. Where it compares 90% of the duration left on the TTL to 10% of the lifetime TTL and gets a new one if <10% of the lifetime. This is to keep it from using TTLs with only a very short time left. This logic would obviously need to change and I'm looking for feedback. It could be something like 10% of the configured % or maybe some fixed amount (eg. <1min). I don't want 0 as you can't manage jitter with a very low duration and we need to avoid thundering herd problems.

Any thoughts on this would be helpful when we get to implementation.

Thanks again for the explanation.

eikenb avatar Sep 26 '22 20:09 eikenb

There are two possibilities that makes sense to me : fixed amount of time and % of the lifetime TTL

  1. Fixed amount has to be limited :

    • lowerbound to handle jitter, network nightmares and thundering herd, as you said : 30 seconds or 1 minutes at the lowest seem reasonable to me as lowest possible values maybe a bit longer for scaling. 5 minutes would be fair too if scaling issues arise.
    • upperbound to handle a configuration exceeding the lifetime TTL to avoid service outage. I can't find a case where it would be pertinent. I don't know if it is possible to ensure the upperbound does no exceed 90% of the -max-ttl configured on the issuer, i wouldn't bet on this ability.
  2. Relative to the lifetime TTL would be my preferred choice but it's personal :

    • providing a valid range from 10% to 90% would cover all real problems if the lifetime TTL isn't too farfetched (eg. 30sec lifetime for a cert is a funny but not very usable in my opinion)

So if we group all of this i think we have a decent rule : the longest between (30 sec or 1 min or 5 min) and configurable % of lifetime TTL.

Malshtur avatar Aug 04 '23 22:08 Malshtur