spire icon indicating copy to clipboard operation
spire copied to clipboard

Failed key preparations can result in extraneous keys being created in your KeyManager

Open keeganwitt opened this issue 1 year ago • 5 comments

We had an AWS account in which SPIRE created 16,000 keys it wasn't using. If we had not caught this, we would have been charged $1 per key per month, so an additional $16,000 would have been charged to us. In this example we are using AWS KMS, but this can apply to other KeyManager plugins as well (such Azure or GCP).

The problem is that PrepareX509CA() and PrepareJWTKey() in manager.go have no logic to delete the newly created key if the function fails at some point after it's creation. For example, if it fails to publish the JWT or if getting the upstream server to sign the X.509 key fails.

keeganwitt avatar Sep 16 '24 19:09 keeganwitt

We do also have evidence of the server pod crashlooping (106 times). So, another question with this issue is how to make the key creation an atomic operation.

keeganwitt avatar Sep 17 '24 17:09 keeganwitt

Thank you @keeganwitt for opening this issue and bringing it to our attention. This is certainly an issue that we want to address in SPIRE. We need to improve the handling of errors while new X.509 and JWT keys are prepared. This will probably involve some refactor in the PrepareX509CA() and PrepareJWTKey() functions, and make changes needed so SPIRE does not discard the created keys even if some later step failed. I'm putting this in the 1.11.1 milestone, so we can start planning the work on this soon.

amartinezfayo avatar Oct 01 '24 17:10 amartinezfayo

Hi @amartinezfayo, thank you for adding this to the roadmap. I am seeing the same issues as well and also using AWS KMS. I'm curious though, shouldn't the dispose keys task take care of this? I see there are no aliases attached to the key and the creation time is definitely longer than 2 days ago so I'm curious if you know why the key isn't picked up for scheduleDelete 😕 Thanks!

faali1 avatar Dec 09 '24 02:12 faali1

Hey folks! @faali1 and I were looking at the code and we think that the delete functionally is broken in KMS because of this https://github.com/spiffe/spire/blob/7b3181d8145c3ab3750028e57d47b7ff67cf434f/pkg/server/plugin/keymanager/awskms/awskms.go#L712

That line won't find any keys, because the key description has the key ID suffix (example: SPIRE_SERVER_KEY/${trust-domain}/x509-CA-A ) while p.descriptionPrefixForTrustDomain() returns SPIRE_SERVER_KEY/${trust-domain}

moe-omar avatar Dec 09 '24 15:12 moe-omar

Hey folks! @faali1 and I were looking at the code and we think that the delete functionally is broken in KMS because of this

https://github.com/spiffe/spire/blob/7b3181d8145c3ab3750028e57d47b7ff67cf434f/pkg/server/plugin/keymanager/awskms/awskms.go#L712

That line won't find any keys, because the key description has the key ID suffix (example: SPIRE_SERVER_KEY/${trust-domain}/x509-CA-A ) while p.descriptionPrefixForTrustDomain() returns SPIRE_SERVER_KEY/${trust-domain}

This is interesting. You're right, this doesn't seem like it'd ever match. But what's strange is in my cluster, I've seen SIPRE mark KMS keys for deletion. The JWT key description also follows a similar pattern, like SPIRE_SERVER_KEY/${escaped-trust-domain}/JWT-Signer-A.

keeganwitt avatar Dec 20 '24 21:12 keeganwitt