argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

feat(appset): Add a cache layer for Argo Projects to speed-up application validation

Open dacofr opened this issue 1 year ago • 8 comments

In our Argo setup, we have several applications sets, and some of them can generate up to 2000 applications. For those applications sets, reconcile loop can take until 2 or 3 minutes. After some investigation and debug, we found that in validateGeneratedApplications a call is done for each application to verify if the related AppProject exists. Once we implemented a cache on AppProject, reconcile loop for appset with more 2K apps drops to ~15-20 seconds.

Number of call to k8s API, before/after using a dedicated image of the appset controller image

Checklist:

  • [ ] Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • [ ] The title of the PR states what changed and the related issues number (used for the release note).
  • [x] The title of the PR conforms to the Toolchain Guide
  • [ ] I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • [ ] I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • [ ] Does this PR require documentation updates?
  • [ ] I've updated documentation as required by this PR.
  • [x] I have signed off all my commits as required by DCO
  • [x] I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • [ ] My build is green (troubleshooting builds).
  • [ ] My new feature complies with the feature status guidelines.
  • [ ] I have added a brief description of why this PR is necessary and/or what this PR solves.
  • [ ] Optional. My organization is added to USERS.md.
  • [ ] Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

dacofr avatar Jun 17 '24 16:06 dacofr

:x: Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • :rocket: /bns:deploy to deploy the environment

bunnyshell[bot] avatar Jun 17 '24 16:06 bunnyshell[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 55.88%. Comparing base (fa54ce2) to head (add382e). Report is 432 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #18703      +/-   ##
==========================================
+ Coverage   55.84%   55.88%   +0.03%     
==========================================
  Files         321      321              
  Lines       44497    44497              
==========================================
+ Hits        24851    24865      +14     
+ Misses      17075    17060      -15     
- Partials     2571     2572       +1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jun 17 '24 18:06 codecov[bot]

In fact, regarding PRs like these that improve performance, how can we ensure that such modifications are really effective? Or how much improvement has been made?

Sn0rt avatar Jun 18 '24 09:06 Sn0rt

Hello,

Thanks for your feedback, to be honest I created this PR a little too early :smile:

We are starting today to run this patch-set in our production to have more accurate data, but sounds promising. On my laptop, for an ApplicationSet who generates 2K apps, I drop from 2mins to ~20secs. I went from 2K calls to K8s API to something like 15 (we have lots of apps spread in few projects.)

I will back in few days, we more data.

dacofr avatar Jun 18 '24 10:06 dacofr

Hello,

Thanks for your feedback, to be honest I created this PR a little too early 😄

We are starting today to run this patch-set in our production to have more accurate data, but sounds promising. On my laptop, for an ApplicationSet who generates 2K apps, I drop from 2mins to ~20secs. I went from 2K calls to K8s API to something like 15 (we have lots of apps spread in few projects.)

I will back in few days, we more data.

In fact, I am also thinking about whether I should make a benchmark for this project so that the subsequent performance improvements will at least have a baseline.

@crenshaw-dev any idea for this ?

Sn0rt avatar Jun 18 '24 11:06 Sn0rt

Hi, I rebase my branch to be up-to-date and to continue to work on it, and now each time appset controller is creating/updating an application, I got this message informer is not a kubernetes informer coming from updateCache function, but I don't why ? (same error with or without my change)

dacofr avatar Jul 01 '24 11:07 dacofr

Note : in our production, we still continue to use the initial implementation of this PR with a dedicated cache layer. I switch to use informer and cache capability offered by internal Client attached to the appset controller.

dacofr avatar Jul 12 '24 15:07 dacofr

Hi, I'm back from holiday and ready to continue to work on this PR :smile:

@crenshaw-dev : PR is up-to-date with master branch, if you have time to have look :pray: Thanks

dacofr avatar Aug 14 '24 10:08 dacofr

Hello, Any informations why this one is not embed with the 2.13 release ?

llavaud avatar Nov 06 '24 14:11 llavaud

Hello, Any informations why this one is not embed with the 2.13 release ?

Hello, I'm not part of release process nor Argo maintainer, but I imagine my PR was merged in master after release-2.13 branch was started.

But @crenshaw-dev have more clue and can answer :smiley:

dacofr avatar Nov 06 '24 16:11 dacofr