fix: 0.4.1 Reduce lock contention, narrowing lock space, pprof, watch retry timeout
Similar with fix: race condition (#50) by d-kuro · Pull Request #51 · argoproj/gitops-engine
Refactoring:
- pprof enable
- using thread-safe data structure holds cluster resources, and remove
clustercache.lock - remove
runSynced
Kudos, SonarCloud Quality Gate passed! 
0 Bugs
0 Vulnerabilities
0 Security Hotspots
17 Code Smells
No Coverage information
1.1% Duplication
@alexmt Maybe I should commit code to master? instead of a hotfix patch to released branch?
I thinks some upstream code has modified via some PR. eg. https://github.com/argoproj/gitops-engine/pull/353 https://github.com/argoproj/gitops-engine/pull/337 .
Should I split this PR to some small PRs?
Waiting for your reply, thanks.
cc @alexmt @jessesuen
Thanks for bringing this up to our attention. I do know there is a lot of room for improvements to lock contention. @alswl - how did you experience lock contention to figure out the fix and validate the fix? Does the lock contention manifest in anyway in the controller, either via errors, timeouts, or slowness?
Maybe I should commit code to master? instead of a hotfix patch to released branch?
I think PR should be against master, because it ultimately needs to go to master and it can be cherry-picked to release branch if necessary.
Please correct my understanding:
Based on what I see in the PR, the improvement being made is that instead of a single lock for an entire clusterInfo, we now have the following locks:
- clusterCache.lock - existing lock but I'm not sure what it's protecting anymore
- clusterCache.apisMeta.lock - lock to protect the API metadata cache about a group/kind
- clusterCache.namespacedResources.lock - lock to protect the map indicating if the group/kind namespaced or not
- clusterCache.resources.lock - lock to protect the individual single resource cache
- clusterCache.nsIndex.lock - lock to protect the namespace-level resources cache
- clusterCache.namespaces.lock - lock to protect list of namespaces engine is monitoring
- clusterCache.openAPISchemaLock - lock to protect openAPISchema
From this list, I think a lot of these additional locks & protections are unnecessary because the fields are reassigned atomically and aren't updated in a lock contentious way.
Also @alswl, do you have evidence of the improvements that resulted in these changes? I do agree there will be improvement, but it will be good to see what that looks like to weight the risk/reward.
@jessesuen Thanks for your patient reply.
I managed huge clusters of Kubernetes(70+ cluster, which some the the clusters has more then 10k nodes). In namespace mode(about 100+ namespaces). After starting, the Argo CD application-controller running normally, and will be in condition racing. It will be dead-lock of listSemaphore and clusterCache.lock. The behavior is reconcile actions stoped, and list / watch goroutine is hang.
I will add more developing logs and pprof information for evidence.
Rebasing with master is WIP, it will be submit after deployed in production environment. It may be done in one week.