chore: avoid resources lock contention (#8172)
Closes https://github.com/argoproj/argo-cd/issues/8172 Improve reconciliation performance for large clusters by avoiding resource lock contention - https://github.com/argoproj/gitops-engine/pull/629.
Checklist:
- [x] Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
- [x] The title of the PR states what changed and the related issues number (used for the release note).
- [x] The title of the PR conforms to the Toolchain Guide
- [x] I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
- [ ] I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
- [ ] Does this PR require documentation updates?
- [ ] I've updated documentation as required by this PR.
- [x] I have signed off all my commits as required by DCO
- [ ] I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
- [ ] My build is green (troubleshooting builds).
- [ ] My new feature complies with the feature status guidelines.
- [x] I have added a brief description of why this PR is necessary and/or what this PR solves.
- [ ] Optional. My organization is added to USERS.md.
- [ ] Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).
:x: Preview Environment deleted from Bunnyshell
Available commands (reply to this comment):
- :rocket:
/bns:deployto deploy the environment
The Integration tests are failing. Can you fix the CI Failures?
Sure, I’ll examine the logs to determine the root cause of the failing integration tests.
@mpelekh lmk if you need any help running e2e tests locally to speed up iteration.
From discussion at SIG Scalability:
- Looking great
- Tests are probably failing because operation status and sync status updates used to be effectively atomic; now there's a 1s delay
- Try increasing ticker to 10s to see what fails, decrease to 0.1s to see if that resolves some flakes
- Assess whether sync and operation statuses are updated actually atomic today; if we break that guarantee, that's probably an Actual Problem
- Instead of shipping behind a flag, YOLO it and plan to revert quickly if it causes trouble
- Try increasing ticker to 10s to see what fails, decrease to 0.1s to see if that resolves some flakes
@crenshaw-dev Here are the test results when the ticker was increased to 10s:
- https://github.com/argoproj/argo-cd/actions/runs/11484297420/job/31961990187?pr=20329
Test results when the ticker was decreased to 0.1s:
- https://github.com/argoproj/argo-cd/actions/runs/11496349194/job/31997981265?pr=20329
Assess whether sync and operation statuses are updated actually atomic today; if we break that guarantee, that's probably an Actual Problem
This is what I see in the code:
-
SyncAppStateis called when the application needs to be synced (argocd synchas been called) https://github.com/argoproj/argo-cd/blob/master/controller/sync.go#L95 - It applies all resources and sets the operation status to Succeed in case of success. It doesn't mean that all resources are synced already, as you can see from the output (these examples are from the built with updates from pull request, where events are processed every 1 second. Application has been created but not synced yet)
root@017f9686563c:/go/src/github.com/argoproj/argo-cd# dist/argocd app get argocd-e2e-external/test-namespaced-config-map --plaintext --server localhost:8080 --auth-token TOKEN --insecure
Name: argocd-e2e-external/test-namespaced-config-map
Project: default
Server: https://kubernetes.default.svc/
Namespace: argocd-e2e--test-namespaced-config-map-dwnet
URL: http://localhost:8080/applications/test-namespaced-config-map
Source:
- Repo: file:///tmp/argo-e2e/testdata.git
Target:
Path: config-map
SyncWindow: Sync Allowed
Sync Policy: Manual
Sync Status: OutOfSync from (bf5ea9e)
Health Status: Healthy
GROUP KIND NAMESPACE NAME STATUS HEALTH HOOK MESSAGE
ConfigMap argocd-e2e--test-namespaced-config-map-dwnet my-map OutOfSync Missing
root@017f9686563c:/go/src/github.com/argoproj/argo-cd# dist/argocd app sync argocd-e2e-external/test-namespaced-config-map --prune --plaintext --server localhost:8080 --auth-token TOKEN --insecure
TIMESTAMP GROUP KIND NAMESPACE NAME STATUS HEALTH HOOK MESSAGE
2024-10-30T21:55:03+00:00 ConfigMap argocd-e2e--test-namespaced-config-map-dwnet my-map OutOfSync Missing
2024-10-30T21:55:03+00:00 ConfigMap argocd-e2e--test-namespaced-config-map-dwnet my-map OutOfSync Missing configmap/my-map created
Name: argocd-e2e-external/test-namespaced-config-map
Project: default
Server: https://kubernetes.default.svc/
Namespace: argocd-e2e--test-namespaced-config-map-dwnet
URL: http://localhost:8080/applications/argocd-e2e-external/test-namespaced-config-map
Source:
- Repo: file:///tmp/argo-e2e/testdata.git
Target:
Path: config-map
SyncWindow: Sync Allowed
Sync Policy: Manual
Sync Status: OutOfSync from (bf5ea9e)
Health Status: Healthy
Operation: Sync
Sync Revision: bf5ea9e6c4184bb3ee436b957f54c1bc8adb3ce3
Phase: Succeeded
Start: 2024-10-30 21:55:03 +0000 UTC
Finished: 2024-10-30 21:55:03 +0000 UTC
Duration: 0s
Message: successfully synced (all tasks run)
GROUP KIND NAMESPACE NAME STATUS HEALTH HOOK MESSAGE
ConfigMap argocd-e2e--test-namespaced-config-map-dwnet my-map OutOfSync Missing configmap/my-map created
root@017f9686563c:/go/src/github.com/argoproj/argo-cd# dist/argocd app get argocd-e2e-external/test-namespaced-config-map --plaintext --server localhost:8080 --auth-token TOKEN --insecure
Name: argocd-e2e-external/test-namespaced-config-map
Project: default
Server: https://kubernetes.default.svc/
Namespace: argocd-e2e--test-namespaced-config-map-dwnet
URL: http://localhost:8080/applications/test-namespaced-config-map
Source:
- Repo: file:///tmp/argo-e2e/testdata.git
Target:
Path: config-map
SyncWindow: Sync Allowed
Sync Policy: Manual
Sync Status: Synced to (bf5ea9e)
Health Status: Healthy
GROUP KIND NAMESPACE NAME STATUS HEALTH HOOK MESSAGE
ConfigMap argocd-e2e--test-namespaced-config-map-dwnet my-map Synced configmap/my-map created
Resources get synced once the related events are received in gitops-engine. Then for each updated resource the handleObjectUpdated callback is called - https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L395
This callback requests the app refresh, which in turn puts item in the queue:
- https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L453
- https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L916
Then, this item is received from the queue in a loop:
- https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L885
And processed:
- https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L933
And only after the Sync status gets Synced, as we can see from the output.
So, it means that the sync and operation status are not atomically updated. They just very quickly updated.
Codecov Report
Attention: Patch coverage is 76.00000% with 6 lines in your changes missing coverage. Please review.
Project coverage is 55.23%. Comparing base (
87c853e) to head (4bed42d). Report is 440 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| controller/cache/cache.go | 40.00% | 2 Missing and 1 partial :warning: |
| controller/metrics/metrics.go | 84.21% | 3 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #20329 +/- ##
==========================================
+ Coverage 53.80% 55.23% +1.42%
==========================================
Files 324 324
Lines 55603 55676 +73
==========================================
+ Hits 29918 30753 +835
+ Misses 23082 22294 -788
- Partials 2603 2629 +26
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.