fix(applicationset): ApplicationSets with rolling sync stuck in Pending
In issue https://github.com/argoproj/argo-cd/issues/19535 we have discovered a bug where ApplicationSet Progressive Sync feature gets stuck and it never recovers and get stuck forever unless you manually delete the existing applicationsStatus of the ApplicationSet affected or trigger a manual sync.
When the ApplicationSet is performing a progressive sync, the apps in the step being synced get the status Pending in the ApplicationSet applicationStatus . This means that the apps are gonna be synced and is waiting for the sync to start progressing.
Let's set an example, applicationset generates 3 applications. In the initial moment applicationset points to commit A
applicationset will generate those 3 applications and start the progressive sync, then application 2 is in Pending status, the applicationset status for application 2 is marked to targetrevision for app2 to A
At this moment in time applicationset gets updated to point to commit B, since app2 is in pending state the progressive sync allows it to the app to be synced and hence the app2 is synced to commit B
since applicationset targetrevision for app2 expects to be A but it's B will never move app2 from 'Pending' to 'Progressing' state.
Here is the logic that performs this check.
This new check was introduced in ArgoCD 2.12 causing this bug when a progressive sync is already being performed.
- Because first it will get the
applicationStatusfrom the existing applicationStatus which is the one that has the old revision - And since the app inside the ApplicationSet
applicationStatusis in "Pending" the revision is never updated when it enters the if statement (see how currentAppStatus.TargetRevision never will be updated)
This means that the ApplicationSet will always think that the app is not being synced to the latest version and never progress, but in reality the app is actually in a later version but tha ApplicationSet never updated it in the apps that are in "Pending".
-
This PR fixes this bug changing the logic that checks when an applications needs to be moved from Pending to Progressing, instead of rely on the targetrevision we actually rely just in the application being synced to move it. This also don't introduce a prior bug where it was cheched that the application was synced in a certain moment in time to ensure it was triggered by the applicationset controller.
-
Note that if someone manually sync one application of the applicationset while it's being progresively synced after merging this PR the applciationset controller will continue the rollout
-
Ensure that a certain revision is applied orderly in all applications generated from the applicationset it's certainly possible that a given application can be synced in a different revision than the one explicitly set in the appset
Fixes: #19535
Checklist:
- [X] Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
- [X] The title of the PR states what changed and the related issues number (used for the release note).
- [X] The title of the PR conforms to the Toolchain Guide
- [X] I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
- [ ] I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
- [ ] Does this PR require documentation updates?
- [ ] I've updated documentation as required by this PR.
- [X] I have signed off all my commits as required by DCO
- [X] I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
- [X] My build is green (troubleshooting builds).
- [ ] My new feature complies with the feature status guidelines.
- [ ] I have added a brief description of why this PR is necessary and/or what this PR solves.
- [ ] Optional. My organization is added to USERS.md.
- [ ] Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).
:x: Preview Environment deleted from Bunnyshell
Available commands (reply to this comment):
- :rocket:
/bns:deployto deploy the environment
:x: Preview Environment deleted from Bunnyshell
Available commands (reply to this comment):
- :rocket:
/bns:deployto deploy the environment
@wmgroot thoughts on this?
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 55.62%. Comparing base (
99cd3c7) to head (27be30e). Report is 502 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #20230 +/- ##
==========================================
+ Coverage 53.38% 55.62% +2.23%
==========================================
Files 339 339
Lines 56836 56828 -8
==========================================
+ Hits 30343 31608 +1265
+ Misses 23882 22565 -1317
- Partials 2611 2655 +44
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Came here to report that we are running this in production since 2 weeks ago and it fixed our issues syncing applicationsets when the conditions explained happened
anyone else than @wmgroot can take a look?
I've chatted w/ some folks at Red Hat. They're interested in moving Progressive Syncs from alpha to beta, so there's a chance they'll be able to put some time into reviews of progressive syncs bugs soon.
@crenshaw-dev happy to share with anyone proof of work on this or contribute further
Thanks for fixing this @Fsero !
@crenshaw-dev @wmgroot I'll also roll out this change in our environment and report if I find any issue if that helps getting this merged.
I have tested this fix in our environment with ArgoCD v2.12 code base while replacing applicationset controller image with the image built from this branch and I can confirm this fixed the issue with stuck rolling sync in our monorepo when new (unrelated) commits are pushed, the updated targetRevisions status don't block the progression of rolling sync.
@Fsero can you please take a look at the failed Integration tests since it's required to pass before merging?
@mubarak-j errors are gone and don't seem to be related with the code of the PR as just rebasing the PR fixes them
Hey 👋🏻 @crenshaw-dev any input on this PR?, it's been running in prod for a while for the authors and other folks. I'm personally not hitting this cause we switched to sourcing apps from a Helm registry with pinned charts. But the proposed change seems to make sense to me
Is there anything blocking this? This bug renders the Progressive Rollout feature unusable.
/cherry-pick release-2.14