spire icon indicating copy to clipboard operation
spire copied to clipboard

Documentation on trust CA rotation missing

Open nweedon-u opened this issue 2 years ago • 2 comments

Documentation and best practices on rotating the trust root CA used by spire-server are missing. Documentation on this matter would be great for ongoing maintenance and avoiding downtime.

@azdagron and I have already discussed parts of this in Slack: https://spiffe.slack.com/archives/CBNCC2V17/p1652786614071339

nweedon-u avatar May 18 '22 12:05 nweedon-u

Slack messages will be autoremoved soon. I just found that awesome conversation so it is better to save a copy there before it will be removed by slack.

Niall Weedon May 17th at 4:23 AM Hey all, would someone be able to explain to me if there’s anything special I’d need to do to roll servers/agents onto an intermediate signed with a new root upstream CA? The current setup is as follows: Single trust domain in high-availability mode KeyManager = memory UpstreamAuthority = vault Root CA -> Intermediate -> Intermediate spire-server uses for signing SVIDs I’m comfortable with what needs to happen when I need to roll the intermediates, as the certificates are set up with the typical certificate chain-of-trust. Something that’s unclear to me though is what (if anything) needs to be done when I roll the root CA and create a new intermediate based off of it? Will SVID generation on the new root and current SVIDs on the old root interact with each other as they normally would if I just had the one root? As further clarification, this question just relates to SVIDs and ongoing workloads and not node attestation. As far as I understand for initial agent bootstrapping, as long as the initial trust bundle has the new CA in it, a new root CA will work just fine. Thanks! :slightly_smiling_face: (edited)

Niall Weedon 1 month ago If I’m reading this correctly, would the suggestion be to move to KeyManager = disk and restart spire-server to point to the upstream that has the certificate chain with the new root. Is this then enough to be confident that the new root will be distributed before being used, given the old intermediates will be read from disk on the restart of the server? (edited)

Andrew Harding 1 month ago Yes, if SPIRE is configured with a persistent KeyManager then this should, in general, just work. SPIRE's active intermediate CA will be chained to the old root. When the lifetime of that CA hits the preparation threshold, SPIRE will prepare a new intermediate CA using the new root. The new root will be pushed into the trust bundle and disseminated to all agents well in advance of it being activated. As long as your bootstrap bundles are kept somewhat current (and sourced from SPIRE server), then new agents should be able to come online just fine. That being said, there is still a window where this kind of configuration could be problematic and that is if a new SPIRE server is turned up after the new root CA is configured upstream, but before any other SPIRE server has prepared a new intermediate. In this instance, since no preparation has happened, the bundle will not have the new root yet. The new SPIRE server will prepare and activate an intermediate signed with the new root, and push it into the bundle, however, it will take time for the new bundle to disseminate. Any SVIDs minted by that SPIRE server will likewise not be able to be authenticated until relying parties receive the bundle update. This includes the servers own SVID. You can mitigate this situation by turning up the new SPIRE server but not putting it into rotation for a period of time to give the bundle time to disseminate.

Niall Weedon 1 month ago Brilliant, thanks so much for the detailed response @azdagron . We’re currently configuring our agents to pull the initial trust bundle from a URL with the trust_bundle_url configuration, as such new agents will pick up the new CA root as soon as it becomes available in the upstream (we will prepare the new CA in the upstream before it’s configured to be used in spire-server). A few follow-up questions: The preparation threshold is set (roughly) by the ca_ttl setting, correct? Is there any way for us (via metrics or at least notionally) to know when the new intermediates have been disseminated to all agents? What is the best practice to know which intermediate is the currently active one per server? Finally, is this (CA rotation) documented anywhere on https://spiffe.io/? The only reason I asked here is because I had trouble finding anything regarding this there. Apologies for all the questions! :sweat_smile: Once again, thanks for the support! :slightly_smiling_face:

Andrew Harding 1 month ago No worries! Questions welcome, anytime. That's what we're here for :slightly_smiling_face: The preparation threshold is set (roughly) by the ca_ttl setting, correct? The ca_ttl setting controls the lifetime of the intermediate (and in the case of an UpstreamAuthority, the preferred TTL that we request, which the UpstreamAuthority is welcome to ignore). Preparation of a new intermediate CA happens when the current active intermediate CA is within 1/2 of its total lifetime (or 30 days, whichever is smaller). Activation of the prepared intermediate CA (i.e. when the CA will be used for signing) happens when the current active intermediate CA is within 1/6 of its total lifetime (or 7 days, whichever is smaller). Is there any way for us (via metrics or at least notionally) to know when the new intermediates have been disseminated to all agents? Kind of. There are metrics to now when preparation has occurred. Unfortunately, there isn't really a way for SPIRE server to know about "all agents", since it cannot distinguish between an agent that has been torn down v.s. one that is offline for whatever reason. SPIRE agent sync interval is by default 5 seconds, so in a healthy deployment, dissemination happens fairly quickly. However, an unhealthy agent could be in a position where it can't sync with the server and may miss the update. What is the best practice to know which intermediate is the currently active one per server? We actually don't have a good way to determine that, yet! There is a proposal that will enable you to query that state from the SPIRE Server APIs as part of a larger work to enable forced revocation and rotation of CAs. Finally, is this (CA rotation) documented anywhere on https://spiffe.io/? The only reason I asked here is because I had trouble finding anything regarding this there. I... don't.... think.... so? We should at least document that in the SPIRE repository. Would you mind opening an issue in the SPIRE repository?

elinesterov avatar Jun 30 '22 03:06 elinesterov

This is somewhat related to #997.

amartinezfayo avatar Jun 30 '22 19:06 amartinezfayo

This issue is stale because it has been open for 365 days with no activity.

github-actions[bot] avatar Jun 30 '23 22:06 github-actions[bot]

This issue was closed because it has been inactive for 30 days since being marked as stale.

github-actions[bot] avatar Jul 31 '23 22:07 github-actions[bot]