Failed key preparations can result in extraneous keys being created in your KeyManager
We had an AWS account in which SPIRE created 16,000 keys it wasn't using. If we had not caught this, we would have been charged $1 per key per month, so an additional $16,000 would have been charged to us. In this example we are using AWS KMS, but this can apply to other KeyManager plugins as well (such Azure or GCP).
The problem is that PrepareX509CA() and PrepareJWTKey() in manager.go have no logic to delete the newly created key if the function fails at some point after it's creation. For example, if it fails to publish the JWT or if getting the upstream server to sign the X.509 key fails.
We do also have evidence of the server pod crashlooping (106 times). So, another question with this issue is how to make the key creation an atomic operation.
Thank you @keeganwitt for opening this issue and bringing it to our attention. This is certainly an issue that we want to address in SPIRE. We need to improve the handling of errors while new X.509 and JWT keys are prepared. This will probably involve some refactor in the PrepareX509CA() and PrepareJWTKey() functions, and make changes needed so SPIRE does not discard the created keys even if some later step failed. I'm putting this in the 1.11.1 milestone, so we can start planning the work on this soon.
Hi @amartinezfayo, thank you for adding this to the roadmap. I am seeing the same issues as well and also using AWS KMS. I'm curious though, shouldn't the dispose keys task take care of this? I see there are no aliases attached to the key and the creation time is definitely longer than 2 days ago so I'm curious if you know why the key isn't picked up for scheduleDelete 😕 Thanks!
Hey folks! @faali1 and I were looking at the code and we think that the delete functionally is broken in KMS because of this https://github.com/spiffe/spire/blob/7b3181d8145c3ab3750028e57d47b7ff67cf434f/pkg/server/plugin/keymanager/awskms/awskms.go#L712
That line won't find any keys, because the key description has the key ID suffix (example: SPIRE_SERVER_KEY/${trust-domain}/x509-CA-A ) while p.descriptionPrefixForTrustDomain() returns SPIRE_SERVER_KEY/${trust-domain}
Hey folks! @faali1 and I were looking at the code and we think that the delete functionally is broken in KMS because of this
https://github.com/spiffe/spire/blob/7b3181d8145c3ab3750028e57d47b7ff67cf434f/pkg/server/plugin/keymanager/awskms/awskms.go#L712
That line won't find any keys, because the key description has the key ID suffix (example:
SPIRE_SERVER_KEY/${trust-domain}/x509-CA-A) whilep.descriptionPrefixForTrustDomain()returnsSPIRE_SERVER_KEY/${trust-domain}
This is interesting. You're right, this doesn't seem like it'd ever match. But what's strange is in my cluster, I've seen SIPRE mark KMS keys for deletion.
The JWT key description also follows a similar pattern, like SPIRE_SERVER_KEY/${escaped-trust-domain}/JWT-Signer-A.