spire
spire copied to clipboard
Agent soft-restart for re-attestation
When an agent SVID expires or is otherwise invalid (e.g. agent has been evicted) the agent needs to re-attest. Currently the only way for this to happen is for the agent to undergo a full restart. There are many disadvantages to this behavior, like the workload SVID cache being purged, (future) debug/health endpoints being unavailable, etc.
Instead of requiring a full restart, the agent should be refactored so that it can perform a soft restart of only the right set of subsystems that are impacted by the agent SVID not being valid.
In order to implement this safely, we need to decide under what conditions the agent should stop, if any, and when the SVID cache should be purged. For example, the cache should probably be purged if the agent is banned
or evicted
but not when the agent SVID expires.
Another question: What should the agent do if re-attstation fails? Is there a point where it gives up and crashes or does it keep attempting (with backoff)?
Another question: What should the agent do if re-attstation fails? Is there a point where it gives up and crashes or does it keep attempting (with backoff)?
I think keep attempting (with backoff) would be the better default as it would provide more graceful recovery after a prolonged network partition from the SPIRE server. However, if the attestations are reaching the server and being rejected, perhaps it should give up after some number of retries. Could we detect network partition and continue retrying until unless some (configurable?) count of attestation attempts reach the server and get rejected?
This issue is stale because it has been open for 365 days with no activity.
This issue was closed because it has been inactive for 30 days since being marked as stale.