sealed-secrets
sealed-secrets copied to clipboard
Decouple "current key" from "latest key"
The suggested backup workflow works a-ok in practice but suffers from a race condition. It also doesn't resonate with a lot of people (possibly because they intuit such a race condition or possibly just because or doc is bad).
The race condition is the following:
- An admin (or automation) backs up the current keys
- Key renewal period (30d) expires and a new key is created
- Another user seals a secret with
kubeseal
(or fetches a cert withkubeseal --fetch-cert
) - The cluster explodes.
If there is not enough time to run a backup between step 2 and 4, there might be one or more sealed secrets encrypted with a key that hasn't been backed up yet (and thus cannot be recovered after the disaster).
That's not usually a big problem if you run frequent backups (daily) or if you synchronize your backup schedule with the key renewal schedule, but the race condition is still there, albeit for a short amount of time.
Sealed secrets works with many keys, and while the youngest key is a bit special because it's the one used by default when sealing a new secret with kubeseal
, you can seal new secrets by using any of the existing keys as well; that can happen since people can use the "offline" method and pass the certificate explicitly with --cert
.
The latest (most recent) key is special because it's the most secure (the less likely to have been leaked in the past; simply because it's younger).
Our current default workflow does indeed prioritise security over disaster recoverability.
The subset of users who care about disaster recoverability are quite vocal, so I think we should do something to help them.
A common ask is control over sealing key renewal, but what I really think should happen is control over advertising the public key of the newly created key as "current" (and thus causing the possibility of encrypting an unrecoverable a secret until a backup is done).
A possible "almost zero knob" UX for this would be to add a delay before a new public key is advertised as current.
For example, imagine adding a default delay of renewal period / 2.
Let's draw a period of 10 days and a delay of 5 days:
key renew: 1---------2---------3---------4---------5---------
current: -----1---------2---------3---------4---------5----
now, this means that as long as you run a periodic backup every 5 days, no matter what's the phase of your backup w.r.t the key renewal, every time a key becomes "current", you're guaranteed to have backed it u:
key renew: 1---------2---------3---------4---------5---------
current: -----1---------2---------3---------4---------5----
b phase 1: 1----1----2----2----3----3----4----4----5----5----
b phase 2: -1----1----2----2----3----3----4----4----5----5---
b phase 3: --1----1----2----2----3----3----4----4----5----5--
b phase 4: ---1----1----2----2----3----3----4----4----5----5-
b phase 5: ----1----1----2----2----3----3----4----4----5----5
i.e. at any vertical slice of that diagram, any of the possible backup schedule "phases" already contains the current key number:
key renew: 1---------2---------3---------4---------5---------
current: -----1---------2---------3<<v------4---------5----
b phase 1: 1----1----2----2----3----3<<v-4----4----5----5----
b phase 2: -1----1----2----2----3----3<v--4----4----5----5---
b phase 3: --1----1----2----2----3----3v---4----4----5----5--
b phase 4: ---1----1----2----2----3----x----4----4----5----5-
b phase 5: ----1----1----2----2----3<<</3----4----4----5----5
CC: @jjo
@mkmik lifecycle LGTM, it would be also great to have the keys annotated with timestamps for its relevant events:
-
t0
: key creation -
t1
: key promotion to "current" -
t2
: key demotion , so that cluster operators could then potentially verify that backups did happen between[t0, t1]
, to be able to assert no "uncovered" backup period happened.