argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

fix(applicationset): ApplicationSets with rolling sync stuck in Pending

Open Fsero opened this issue 1 year ago • 4 comments

In issue https://github.com/argoproj/argo-cd/issues/19535 we have discovered a bug where ApplicationSet Progressive Sync feature gets stuck and it never recovers and get stuck forever unless you manually delete the existing applicationsStatus of the ApplicationSet affected or trigger a manual sync.

When the ApplicationSet is performing a progressive sync, the apps in the step being synced get the status Pending in the ApplicationSet applicationStatus . This means that the apps are gonna be synced and is waiting for the sync to start progressing.

Let's set an example, applicationset generates 3 applications. In the initial moment applicationset points to commit A

applicationset will generate those 3 applications and start the progressive sync, then application 2 is in Pending status, the applicationset status for application 2 is marked to targetrevision for app2 to A

At this moment in time applicationset gets updated to point to commit B, since app2 is in pending state the progressive sync allows it to the app to be synced and hence the app2 is synced to commit B

since applicationset targetrevision for app2 expects to be A but it's B will never move app2 from 'Pending' to 'Progressing' state.

Here is the logic that performs this check.

This new check was introduced in ArgoCD 2.12 causing this bug when a progressive sync is already being performed.

  • Because first it will get the applicationStatus from the existing applicationStatus which is the one that has the old revision
  • And since the app inside the ApplicationSet applicationStatus is in "Pending" the revision is never updated when it enters the if statement (see how currentAppStatus.TargetRevision never will be updated)

This means that the ApplicationSet will always think that the app is not being synced to the latest version and never progress, but in reality the app is actually in a later version but tha ApplicationSet never updated it in the apps that are in "Pending".

  • This PR fixes this bug changing the logic that checks when an applications needs to be moved from Pending to Progressing, instead of rely on the targetrevision we actually rely just in the application being synced to move it. This also don't introduce a prior bug where it was cheched that the application was synced in a certain moment in time to ensure it was triggered by the applicationset controller.

  • Note that if someone manually sync one application of the applicationset while it's being progresively synced after merging this PR the applciationset controller will continue the rollout

  • Ensure that a certain revision is applied orderly in all applications generated from the applicationset it's certainly possible that a given application can be synced in a different revision than the one explicitly set in the appset

Fixes: #19535

Checklist:

  • [X] Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • [X] The title of the PR states what changed and the related issues number (used for the release note).
  • [X] The title of the PR conforms to the Toolchain Guide
  • [X] I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • [ ] I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • [ ] Does this PR require documentation updates?
  • [ ] I've updated documentation as required by this PR.
  • [X] I have signed off all my commits as required by DCO
  • [X] I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • [X] My build is green (troubleshooting builds).
  • [ ] My new feature complies with the feature status guidelines.
  • [ ] I have added a brief description of why this PR is necessary and/or what this PR solves.
  • [ ] Optional. My organization is added to USERS.md.
  • [ ] Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

Fsero avatar Oct 04 '24 13:10 Fsero

:x: Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • :rocket: /bns:deploy to deploy the environment

bunnyshell[bot] avatar Oct 04 '24 13:10 bunnyshell[bot]

:x: Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • :rocket: /bns:deploy to deploy the environment

bunnyshell[bot] avatar Oct 04 '24 13:10 bunnyshell[bot]

@wmgroot thoughts on this?

crenshaw-dev avatar Oct 06 '24 21:10 crenshaw-dev

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 55.62%. Comparing base (99cd3c7) to head (27be30e). Report is 502 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #20230      +/-   ##
==========================================
+ Coverage   53.38%   55.62%   +2.23%     
==========================================
  Files         339      339              
  Lines       56836    56828       -8     
==========================================
+ Hits        30343    31608    +1265     
+ Misses      23882    22565    -1317     
- Partials     2611     2655      +44     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 07 '24 10:10 codecov[bot]

Came here to report that we are running this in production since 2 weeks ago and it fixed our issues syncing applicationsets when the conditions explained happened

Fsero avatar Oct 21 '24 15:10 Fsero

anyone else than @wmgroot can take a look?

Fsero avatar Oct 21 '24 15:10 Fsero

I've chatted w/ some folks at Red Hat. They're interested in moving Progressive Syncs from alpha to beta, so there's a chance they'll be able to put some time into reviews of progressive syncs bugs soon.

crenshaw-dev avatar Nov 08 '24 15:11 crenshaw-dev

@crenshaw-dev happy to share with anyone proof of work on this or contribute further

Fsero avatar Nov 19 '24 15:11 Fsero

Thanks for fixing this @Fsero !

@crenshaw-dev @wmgroot I'll also roll out this change in our environment and report if I find any issue if that helps getting this merged.

fcrespofastly avatar Nov 21 '24 15:11 fcrespofastly

I have tested this fix in our environment with ArgoCD v2.12 code base while replacing applicationset controller image with the image built from this branch and I can confirm this fixed the issue with stuck rolling sync in our monorepo when new (unrelated) commits are pushed, the updated targetRevisions status don't block the progression of rolling sync.

mubarak-j avatar Nov 25 '24 23:11 mubarak-j

@Fsero can you please take a look at the failed Integration tests since it's required to pass before merging?

mubarak-j avatar Dec 18 '24 15:12 mubarak-j

@mubarak-j errors are gone and don't seem to be related with the code of the PR as just rebasing the PR fixes them

Fsero avatar Jan 07 '25 10:01 Fsero

Hey 👋🏻 @crenshaw-dev any input on this PR?, it's been running in prod for a while for the authors and other folks. I'm personally not hitting this cause we switched to sourcing apps from a Helm registry with pinned charts. But the proposed change seems to make sense to me

fcrespofastly avatar Jan 21 '25 18:01 fcrespofastly

Is there anything blocking this? This bug renders the Progressive Rollout feature unusable.

purkhusid avatar Feb 18 '25 13:02 purkhusid

/cherry-pick release-2.14

jannfis avatar Feb 22 '25 18:02 jannfis