gitops-engine icon indicating copy to clipboard operation
gitops-engine copied to clipboard

All GKs are re-synced in cluster cache even if only one fails

Open crenshaw-dev opened this issue 1 year ago • 0 comments

The full cluster resource cache is built 1) on startup, 2) every 24 hours (by default, configurable) and 3) every 10 seconds (by default, configurable) if there's an error while building the cache.

The third can be a problem.

Suppose we have a cluster with 100 API group/kinds (GKs). And suppose, for some weird reason, the cluster has a LOT of some particular GK (let's say RoleBindings). When gitops-engine syncs the cluster cache, it will list all the RoleBindings. If the "continue token" (used for pagination) expires while listing all those RoleBindings, gitops-engine will note that the sync is "failed," and 10 seconds later, it will attempt to rebuild the whole cache. Rebuilding the whole cache is incredibly wasteful, especially if all 99 other GKs were successfully cached.

The problem can exacerbate itself. By hammering the k8s API with requests every 10 seconds for every resource, we likely increase k8s response times and increase the likelihood of errors.

To see if you're affected by this problem, search your logs for "Start syncing cluster". It should only happen every 24 hours by default. If you're seeing it more often than that, you're impacted.

I recommend two mitigations:

  1. Only retry the GKs which experienced errors. If we successfully cache 99 GKs, don't attempt to re-load those items.
  2. Back-off retries instead of using a static 10s timeout. If the problem is caused by cluster load, maybe we alleviate the issue by retrying less often.

crenshaw-dev avatar May 10 '23 14:05 crenshaw-dev