argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

chore: avoid resources lock contention (#8172)

Open mpelekh opened this issue 1 year ago • 3 comments

Closes https://github.com/argoproj/argo-cd/issues/8172 Improve reconciliation performance for large clusters by avoiding resource lock contention - https://github.com/argoproj/gitops-engine/pull/629.

Checklist:

  • [x] Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • [x] The title of the PR states what changed and the related issues number (used for the release note).
  • [x] The title of the PR conforms to the Toolchain Guide
  • [x] I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • [ ] I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • [ ] Does this PR require documentation updates?
  • [ ] I've updated documentation as required by this PR.
  • [x] I have signed off all my commits as required by DCO
  • [ ] I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • [ ] My build is green (troubleshooting builds).
  • [ ] My new feature complies with the feature status guidelines.
  • [x] I have added a brief description of why this PR is necessary and/or what this PR solves.
  • [ ] Optional. My organization is added to USERS.md.
  • [ ] Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

mpelekh avatar Oct 10 '24 12:10 mpelekh

:x: Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • :rocket: /bns:deploy to deploy the environment

bunnyshell[bot] avatar Oct 10 '24 12:10 bunnyshell[bot]

The Integration tests are failing. Can you fix the CI Failures?

Sure, I’ll examine the logs to determine the root cause of the failing integration tests.

mpelekh avatar Oct 11 '24 15:10 mpelekh

@mpelekh lmk if you need any help running e2e tests locally to speed up iteration.

crenshaw-dev avatar Oct 11 '24 15:10 crenshaw-dev

From discussion at SIG Scalability:

  1. Looking great
  2. Tests are probably failing because operation status and sync status updates used to be effectively atomic; now there's a 1s delay
  3. Try increasing ticker to 10s to see what fails, decrease to 0.1s to see if that resolves some flakes
  4. Assess whether sync and operation statuses are updated actually atomic today; if we break that guarantee, that's probably an Actual Problem
  5. Instead of shipping behind a flag, YOLO it and plan to revert quickly if it causes trouble

crenshaw-dev avatar Oct 23 '24 17:10 crenshaw-dev

  1. Try increasing ticker to 10s to see what fails, decrease to 0.1s to see if that resolves some flakes

@crenshaw-dev Here are the test results when the ticker was increased to 10s:

  • https://github.com/argoproj/argo-cd/actions/runs/11484297420/job/31961990187?pr=20329

Test results when the ticker was decreased to 0.1s:

  • https://github.com/argoproj/argo-cd/actions/runs/11496349194/job/31997981265?pr=20329

mpelekh avatar Oct 24 '24 11:10 mpelekh

Assess whether sync and operation statuses are updated actually atomic today; if we break that guarantee, that's probably an Actual Problem

This is what I see in the code:

  • SyncAppState is called when the application needs to be synced (argocd sync has been called) https://github.com/argoproj/argo-cd/blob/master/controller/sync.go#L95
  • It applies all resources and sets the operation status to Succeed in case of success. It doesn't mean that all resources are synced already, as you can see from the output (these examples are from the built with updates from pull request, where events are processed every 1 second. Application has been created but not synced yet)
root@017f9686563c:/go/src/github.com/argoproj/argo-cd# dist/argocd app get argocd-e2e-external/test-namespaced-config-map --plaintext --server localhost:8080 --auth-token TOKEN --insecure
Name:               argocd-e2e-external/test-namespaced-config-map
Project:            default
Server:             https://kubernetes.default.svc/
Namespace:          argocd-e2e--test-namespaced-config-map-dwnet
URL:                http://localhost:8080/applications/test-namespaced-config-map
Source:
- Repo:             file:///tmp/argo-e2e/testdata.git
  Target:           
  Path:             config-map
SyncWindow:         Sync Allowed
Sync Policy:        Manual
Sync Status:        OutOfSync from  (bf5ea9e)
Health Status:      Healthy

GROUP  KIND       NAMESPACE                                     NAME    STATUS     HEALTH   HOOK  MESSAGE
       ConfigMap  argocd-e2e--test-namespaced-config-map-dwnet  my-map  OutOfSync  Missing        



root@017f9686563c:/go/src/github.com/argoproj/argo-cd# dist/argocd app sync argocd-e2e-external/test-namespaced-config-map  --prune --plaintext --server localhost:8080 --auth-token TOKEN --insecure
TIMESTAMP                  GROUP        KIND   NAMESPACE                                                    NAME    STATUS    HEALTH        HOOK  MESSAGE
2024-10-30T21:55:03+00:00          ConfigMap  argocd-e2e--test-namespaced-config-map-dwnet                my-map  OutOfSync  Missing              
2024-10-30T21:55:03+00:00          ConfigMap  argocd-e2e--test-namespaced-config-map-dwnet                my-map  OutOfSync  Missing              configmap/my-map created

Name:               argocd-e2e-external/test-namespaced-config-map
Project:            default
Server:             https://kubernetes.default.svc/
Namespace:          argocd-e2e--test-namespaced-config-map-dwnet
URL:                http://localhost:8080/applications/argocd-e2e-external/test-namespaced-config-map
Source:
- Repo:             file:///tmp/argo-e2e/testdata.git
  Target:           
  Path:             config-map
SyncWindow:         Sync Allowed
Sync Policy:        Manual
Sync Status:        OutOfSync from  (bf5ea9e)
Health Status:      Healthy

Operation:          Sync
Sync Revision:      bf5ea9e6c4184bb3ee436b957f54c1bc8adb3ce3
Phase:              Succeeded
Start:              2024-10-30 21:55:03 +0000 UTC
Finished:           2024-10-30 21:55:03 +0000 UTC
Duration:           0s
Message:            successfully synced (all tasks run)

GROUP  KIND       NAMESPACE                                     NAME    STATUS     HEALTH   HOOK  MESSAGE
       ConfigMap  argocd-e2e--test-namespaced-config-map-dwnet  my-map  OutOfSync  Missing        configmap/my-map created


root@017f9686563c:/go/src/github.com/argoproj/argo-cd# dist/argocd app get argocd-e2e-external/test-namespaced-config-map --plaintext --server localhost:8080 --auth-token TOKEN --insecure
Name:               argocd-e2e-external/test-namespaced-config-map
Project:            default
Server:             https://kubernetes.default.svc/
Namespace:          argocd-e2e--test-namespaced-config-map-dwnet
URL:                http://localhost:8080/applications/test-namespaced-config-map
Source:
- Repo:             file:///tmp/argo-e2e/testdata.git
  Target:           
  Path:             config-map
SyncWindow:         Sync Allowed
Sync Policy:        Manual
Sync Status:        Synced to  (bf5ea9e)
Health Status:      Healthy

GROUP  KIND       NAMESPACE                                     NAME    STATUS  HEALTH  HOOK  MESSAGE
       ConfigMap  argocd-e2e--test-namespaced-config-map-dwnet  my-map  Synced                configmap/my-map created

Resources get synced once the related events are received in gitops-engine. Then for each updated resource the handleObjectUpdated callback is called - https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L395

This callback requests the app refresh, which in turn puts item in the queue:

  • https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L453
  • https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L916

Then, this item is received from the queue in a loop:

  • https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L885

And processed:

  • https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L933

And only after the Sync status gets Synced, as we can see from the output.

So, it means that the sync and operation status are not atomically updated. They just very quickly updated.

mpelekh avatar Nov 06 '24 15:11 mpelekh

Codecov Report

Attention: Patch coverage is 76.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 55.23%. Comparing base (87c853e) to head (4bed42d). Report is 440 commits behind head on master.

Files with missing lines Patch % Lines
controller/cache/cache.go 40.00% 2 Missing and 1 partial :warning:
controller/metrics/metrics.go 84.21% 3 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #20329      +/-   ##
==========================================
+ Coverage   53.80%   55.23%   +1.42%     
==========================================
  Files         324      324              
  Lines       55603    55676      +73     
==========================================
+ Hits        29918    30753     +835     
+ Misses      23082    22294     -788     
- Partials     2603     2629      +26     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Nov 14 '24 18:11 codecov[bot]