spire Reduce network usage of Agent registration sync through shallow entry fetching

Problem

If a SPIRE deployment contains either

a large scale of Agents or
Agents with large scale of authorized entries or
both then the periodic (default 5s) sync for Agent requesting authorized entries from Server can cause in aggregate significant load on the network. Each request will return a full list of entries, defined here. Each object instance has multiple strings, multiple string arrays of any size, a couple bool's, and a few numbers.

With just a limited set of entry information as shown in a toy test here, an entry's size is already about 235 bytes. By setting more selectors, federate relationships, DNS Names, or TTL data the size would grow more, and can easily be >300-400 bytes each.

I believe we can make network usage more efficient here. In addition to lowering network capacity needs this can also help to alleviate the issue of message size limits called out in #2675 .

Proposal

Logic

Split Agent's Authorized Entries syncing into two requests, for sake of having a name for them ShallowFetch and DeepFetch.

Agent first calls ShallowFetch, which will return a list of authorized "shallow" entries, but each entry object will only contain the entry ID (a UUID string) and the Revision Number (and int64). Each returned shallow entry is going to be just under 100 bytes. For each shallow entry, comparing with the Agent's current cache:

If there is a cache entry not present in the shallow entries, drop it from the cache (same logic as today)
If there is a shallow entry ID not present in the cache, mark this as needing a Deep Fetch
If there is a shallow entry ID present in the cache but at a different revision number, mark this as needing a Deep Fetch
If there is a shallow entry ID present in the cache and at the same revision number, do nothing further for this entry

For each entry needing a DeepFetch, request the full entry objects in a batch request to SPIRE Server.

Possibly, request instead in batch sizes likely to respect message size limits. If any particular batch request fails, Agent would lack the entries in the cache and thus automatically try to fetch them again in the next sync interval while still maintaining successful results from prior interation.

Result

Network usage reduction

On startup, Agent will always still fetch all Deep Entries, since there is nothing in the cache.

We make the assumption from experience that usually (99%+ of the time) an Agent's list of authorized entries and those entry's revisions will not change, or at least not change drastically, from sync to sync iteration. In these cases, either:

no entries have been updated or added, so only a ShallowFetch is performed, which gives a response size 42% (100/235) that of a DeepFetch but likely even smaller (100/(300 to 400) = 25 to 33%).
a small subset of new entries have been updated or added. If we are talking about affecting thousands of Agents, or for Agents that have 10s of thousands of entries, the handful of DeepFetch results needed is likely a drop in the bucket compared to the full ShallowFetch or definitely compared to today's experience.

Low scale deployments would see less aggregate gain from this, but should still see a reduction.

Deployments where Agent sync interval has been configured to be infrequent, or the authorized registrations for Agents are frequently (secondly/minutely) added to or updated, may see in increase in network overhead as every sync interval results in a large "cache miss" in the ShallowFetch.

Failure resiliency

If DeepFetch's are partitioned into independent batches, we don't have to worry about transient issues causing an entire sync attempt to fail, and also may be able to alleviate the issue of message size limits preventing syncs of 10s of thousands of authorized entries. This does not address if ShallowFetch's continuously fail for some reason, like if even the list of shallow entries is large enough to violate size limits.

Mar 11 '22 21:03 amoore877

What about implementing registration synchronization as a Merkle tree?

Mar 15 '22 19:03 zmt

From contributor sync:

GetAuthorizedEntries already has a mask to only return some fields
- Though without Server API change the DeepFetch will always be everything. Still more efficient in aggregate but could be better by only fetching exactly what's needed, as well as the resiliency and respecting message sizes introduced by batching requests
  - Better DeepFetch could be a follow-on?
- Paging options for this to handle message size limits? May have cache coherency problem between SPIRE Server instances, though if we switched the function to Streaming instead of Unary this would not occur. Paging over the DB cache though could complicate things (what if a cache sync occurs while streaming?)
  - Future datastore changes may make paging a better
Maybe even a hash of the whole set could be returned to the Agent, cutting down shallow fetch even more
Merkle tree implementation for details on changes?

Mar 15 '22 19:03 amoore877

I've put the new proposal that was settled on in a previous contributor sync into #3496, which was largely inspired by @amoore877 's proposal above (thanks!). The most noted departure is that #3496 implements this as a single streaming RPC to alleviate our concerns about server split-view and doing multiple authorized entry crawls.

We'll close this and other related agent sync issues that are solved by the proposal, for now. If the proposal is rejected, we can reopen and reevaluate.

Oct 12 '22 15:10 azdagron

spire spire copied to clipboard

Reduce network usage of Agent registration sync through shallow entry fetching

Problem

Proposal

Logic

Result

Network usage reduction

Failure resiliency

spire
spire copied to clipboard