gitops-engine fix: 0.4.1 Reduce lock contention, narrowing lock space, pprof, watch retry timeout

Refactoring:

pprof enable
using thread-safe data structure holds cluster resources, and remove clustercache.lock
remove runSynced

Nov 15 '21 03:11 alswl

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
17 Code Smells

No Coverage information
1.1% Duplication

Nov 15 '21 03:11 sonarqubecloud[bot]

@alexmt Maybe I should commit code to master? instead of a hotfix patch to released branch?

I thinks some upstream code has modified via some PR. eg. https://github.com/argoproj/gitops-engine/pull/353 https://github.com/argoproj/gitops-engine/pull/337 .

Should I split this PR to some small PRs?

Waiting for your reply, thanks.

Nov 23 '21 11:11 alswl

cc @alexmt @jessesuen

Dec 09 '21 20:12 terrytangyuan

Thanks for bringing this up to our attention. I do know there is a lot of room for improvements to lock contention. @alswl - how did you experience lock contention to figure out the fix and validate the fix? Does the lock contention manifest in anyway in the controller, either via errors, timeouts, or slowness?

Maybe I should commit code to master? instead of a hotfix patch to released branch?

I think PR should be against master, because it ultimately needs to go to master and it can be cherry-picked to release branch if necessary.

Dec 09 '21 20:12 jessesuen

Please correct my understanding:

Based on what I see in the PR, the improvement being made is that instead of a single lock for an entire clusterInfo, we now have the following locks:

clusterCache.lock - existing lock but I'm not sure what it's protecting anymore
clusterCache.apisMeta.lock - lock to protect the API metadata cache about a group/kind
clusterCache.namespacedResources.lock - lock to protect the map indicating if the group/kind namespaced or not
clusterCache.resources.lock - lock to protect the individual single resource cache
clusterCache.nsIndex.lock - lock to protect the namespace-level resources cache
clusterCache.namespaces.lock - lock to protect list of namespaces engine is monitoring
clusterCache.openAPISchemaLock - lock to protect openAPISchema

From this list, I think a lot of these additional locks & protections are unnecessary because the fields are reassigned atomically and aren't updated in a lock contentious way.

Also @alswl, do you have evidence of the improvements that resulted in these changes? I do agree there will be improvement, but it will be good to see what that looks like to weight the risk/reward.

Dec 09 '21 22:12 jessesuen

@jessesuen Thanks for your patient reply.

I managed huge clusters of Kubernetes(70+ cluster, which some the the clusters has more then 10k nodes). In namespace mode(about 100+ namespaces). After starting, the Argo CD application-controller running normally, and will be in condition racing. It will be dead-lock of listSemaphore and clusterCache.lock. The behavior is reconcile actions stoped, and list / watch goroutine is hang.

I will add more developing logs and pprof information for evidence.

Rebasing with master is WIP, it will be submit after deployed in production environment. It may be done in one week.

Dec 12 '21 11:12 alswl