spire icon indicating copy to clipboard operation
spire copied to clipboard

Remove server dependency on on-disk data

Open evan2645 opened this issue 4 years ago • 2 comments

SPIRE Server currently stores a small amount of data on disk, which it uses to recover some state after restarts/reboots. That data is limited to public key information and CA certificates. In the case of X.509, when the server boots it looks on disk for any previously-generated CA cert, and then attempts to load the corresponding key from the keymanager. This was originally done because we don't have any kind of unique server identifier that could be used to identify this server's CA certificate in the datastore, and there was a desire to avoid needing a unique server identifier due to the complexity it would introduce.

This on-disk data poses a problem though, particularly in k8s environments and in deployments that do not utilize an UpstreamAuthority. If a server is restarted and it loses its on-disk cache, it will need to mint a new cert and this can lead to a period of unavailability as the newly minted cert is added to the bundle and propagates. To work around this issue, many people deploy SPIRE as a Stateful Set in k8s.

Proposed Solution

Rather than create a unique server identifier for the purposes of plucking this server's CA certificate from the datastore, we can infer which certificate belongs to us by looking at the public keys available to us in the keymanager. Certs in the bundle that match our keys are ours, and the rotation state (i.e. active vs prepared) can be inferred by inspecting notBefore/notAfter.

This works well for deployments without an UpsreamAuthority. For deployments that do have an UpstreamAuthority, we would end up minting a new intermediate at boot. Maybe that's ok, maybe it's not (e.g. a tight crash loop?). Recovering intermediates could potentially be solved in a similar way if deemed desirable (i.e. store the minted intermediates similar to the way we store and manage the bundle, but for the sole purpose of cert recovery).

evan2645 avatar Sep 15 '20 17:09 evan2645

I think the proposed solution looks good. It do think it will require storing the intermediates in the datastore. I also think there will be some considerations around migrating into this solution and deprecating the journal. For example, when this feature is turned on, upgraded servers won't have their intermediates stored in the datastore until the next rotation and will still need to rely on the journal. Of course we'd also have to handle the schema migration introducing the new table. I imagine it would take us three releases to totally migrate:

1st "major" release) Introduce new table (according to our schema migration policies) 2nd "major" release) Import journal into new table (and maybe remove journal afterwards as a signal that it has been imported). Start populating table on CA/key preparation. 3rd "major" release) Remove code related to the journal.

The hydration of the journal in step two is handle the case where somebody upgrades from 1->2->3 quickly enough that they don't hit a normal rotation event that would trigger the table to be populated.

azdagron avatar Oct 28 '20 15:10 azdagron

We'll probably want to also introduce a pruning operation for this intermediate CA cache. Maybe just piggy back off of PruneBundle?

azdagron avatar Oct 28 '20 15:10 azdagron

I've seen increasing interest from the SPIRE community on this issue to be addressed. Since I think that we are all in agreement that this is something that we want to have resolved, I think that it would be good if we can start working on this. Also, the force rotation changes happening at this moment seems to provide a good opportunity to make this change. I'll be happy to help with finalizing the scope and design of the solution.

amartinezfayo avatar Jun 30 '23 22:06 amartinezfayo

What is the size of the payload being stored? If it's just public keys to lookup, you could store it in a configmap. 1M gzip goes a long way

drewwells avatar Aug 01 '23 23:08 drewwells

What is the size of the payload being stored?

It isn't much data, just some per-server tracking information for x.509 and jwt authorities for that server.

you could store it in a configmap

SPIRE operates in many environments, not just K8s, so whatever solution we come up with needs to work everywhere.

azdagron avatar Aug 01 '23 23:08 azdagron

Just make configmaps a persistence type.

drewwells avatar Aug 02 '23 00:08 drewwells

@evan2645 Can you offer some visibility here? You're the original author of the issue, and if it was an idea without an implementation, that would be useful to know. If you had ideas about the implementation approach, that would be very useful to know too.

edwbuck avatar Aug 18 '23 14:08 edwbuck

@edwbuck There is already a proposed solution, although there are details that we need to define to clearly establish the scope. I'm working on that, and I'll put together a concrete plan to follow so it can be discussed. This will require database schema changes, and we plan to introduce those schema changes in the 1.8.0 release.

amartinezfayo avatar Aug 18 '23 14:08 amartinezfayo

Proposed plan is:

  • [1.8.0] (#4465) Introduce the schema changes to store in the database the journal information that is currently stored on disk. One flexible way to store this is to add a table that will essentially contain the same data we store today in the journal file (that’s a PEM encoded protobuf). So we can have a table with a data blob field, where we can encode what we want there. We can have pretty much the same than what we have today in the journal (information about prepared, active, and old X509 and JWT authorities). This would provide us the flexibility to grow the structure as needed without having to make schema changes, since it’s all marshalled in a protobuf. To efficiently identify the row belonging to this server, we can have separate columns storing the Subject Key ID of the current X509 authority and the KID claim of the JWT authority to identify the current slots and match that with what we get from a call to GetPublicKeys in the key manager.
  • [1.9.0] (#4463, #4464) Migrate the information from the journal file to the table in the database. Start populating the database with information about prepared, active, and old X509 and JWT authorities.

amartinezfayo avatar Aug 29 '23 18:08 amartinezfayo

Done in #4690.

amartinezfayo avatar Jan 03 '24 15:01 amartinezfayo