nomad
nomad copied to clipboard
keyring: replication tries to replicate rotated-away keys
In https://github.com/hashicorp/nomad/issues/19340 @sbihel reported a behavior where the followers would try to replicate keys that had been previously rotated out, and this would fail:
[WARN] nomad.keyring.replicator: failed to fetch key from current leader, trying peers: key=128ba7c1-baa0-3bc6-c20f-833b97a1fbe2 error=
[ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=128ba7c1-baa0-3bc6-c20f-833b97a1fbe2 error="rpc error: no such key "128ba7c1-baa0-3bc6-c20f-833b97a1fbe2" in keyring" [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: rpc error: no such key "128ba7c1-baa0-3bc6-c20f-833b97a1fbe2" in keyring: key=128ba7c1-baa0-3bc6-c20f-833b97a1fbe2
#19340 covered another critical bug and was automatically closed once the fix was merged. This issue is a follow-up.
The specific error we're getting here is when the server we're replicating the key from tries to get the key material from its keyring. That key material isn't present anymore so the replication can't work anymore. That's not an unexpected scenario by itself, because we have to handle that for when we want to bootstrap the keyring from one server to all the other servers (and some servers may get replication requests for keys they don't yet have).
But for what is effectively an "orphaned" key, we're in a messy spot. We can't guarantee that the key is safe to remove from the metadata, because the operator may have had a bad recovery process and needs to restore the on-disk keyring to the servers. As a workaround, the operator can remove the key via nomad operator root keyring remove if they know it's truly orphaned. But being able to fix https://github.com/hashicorp/nomad/issues/19368 seems important to figure out to fix this issue.
Ref https://github.com/hashicorp/nomad/issues/19669
I've done some testing and I believe this will be resolved by the work done in https://github.com/hashicorp/nomad/pull/23577. I'm going to close this issue out.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.