argo-cd
argo-cd copied to clipboard
fix(applicationset): prevent applicationset progressive sync from stalling in pending
Fixes https://github.com/argoproj/argo-cd/issues/12459
ApplicationSet Controller Progressive Sync sometimes looks to get into a state where the Applications are never synced and a rollout never concludes, based on the timestamps where an ApplicationSet may have transitioned as time T and triggered a sync - but somehow an Application Sync Operation indicates that it was triggered much earlier at T-10..N. The disparity looks to possibly be in the order of minutes, we've seen it happen with ~20-30 seconds of disparity. However based on #12459 it looks like could be much more. There is some existing behaviour implemented in https://github.com/argoproj/argo-cd/pull/13926 - which manages to resolve cases where the disparity is lesser than 10s - but doesn't help resolve cases where it exceed this.
This pull resolves the issue by setting an identifier for a sync triggered by the ApplicationSet controller, so it may be compared to the details of an operation ie. it was triggered by the applicationset controller and has the expected id, while still allowing other syncs triggered by other users to result in transitioning from Pending to Progressing provided the sync occured after ApplicationSet Controller transitioned the Application to Pending.
This does not patch a regression, as it's affected previous versions as early as 2.6, 2.7 and still affects 2.10. It may require cherry picking to older versions.
Checklist:
- [X] Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
- [x] The title of the PR states what changed and the related issues number (used for the release note).
- [x] The title of the PR conforms to the Toolchain Guide
- [x] I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
- [x] I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
- [ ] Does this PR require documentation updates?
- [ ] I've updated documentation as required by this PR.
- [x] I have signed off all my commits as required by DCO
- [x] I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
- [x] My build is green (troubleshooting builds).
- [ ] My new feature complies with the feature status guidelines.
- [x] I have added a brief description of why this PR is necessary and/or what this PR solves.
- [ ] Optional. My organization is added to USERS.md.
- [x] Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).
Beyond using a UUID etc. for tracking a sync triggered by applicationset-controller - I wonder if we should store the revision of a sync when it's out of date / new, which I believe would make such an id redundant.
ie. TargetRevision: app.Status.Sync.Revision
, and during comparison ensure that the operation sync matches the target revision app.Status.OperationState.SyncResult.Revision == currentAppStatus.TargetRevision
.
Would this fix be possible without a CRD change? Usually, we have a policy of not cherry-picking changes to CRDs back into already released branches, for update simplicity.
If it's not possible without CRD change, I think we may want to forward-fix this only.
@jannfis Thanks for the feedback. I'm not sure that in either of the approaches suggested on my part ie. usage of a generated ID (the contents of this PR as of 7c14437), or using out of sync revisions which I've drafted up an additional change targeting my fork here: https://github.com/wparr-circle/argo-cd/pull/1
So I think as you say, this will have to be a forward only change.
Beyond that, I'd like to open for feedback on the thoughts of the maintainers as to which approach they prefer (personally I'm thinking that the approach in https://github.com/wparr-circle/argo-cd/pull/1 is better. However, I'm not 100% if these fields are always available as I am assuming.
@wparr-circle I took a look at the PR in your fork, and I also like that change better than the original. It's much simpler and easier to read.
However, I'm not 100% if these fields are always available as I am assuming.
You will get the target revisions as long as the application has synced at least once, otherwise, you'll get an empty list. What exactly is your assumption about this field, i.e. what properties does it need to match to suite your logic?
@jannfis my assumption was along the lines of what you replied here:
You will get the target revisions as long as the application has synced at least once, otherwise, you'll get an empty list
ie. that it'll be populated with some value in all cases of git, kustomize, helm etc. provided the app has been synced once.
And that the fields of the struct:
-
app.Status.OperationState.SyncResult.Revisions
/app.Status.OperationState.SyncResult.Revision
- Corresponds to the last revision that a sync operation resolved to. -
app.Sync.Revisions
- Corresponds to the last revision that a reconcile resolved to. ie. when out of sync this will be different from the former when auto-sync is disabled or the sync operation otherwise hasn't run yet.
I've updated this PR with the contents from https://github.com/wparr-circle/argo-cd/pull/1
Following up on this @jannfis, any chance a maintainer could have a look over this.
@wmgroot any concerns about the approach?
This approach looks good to me.
@wparr-circle just tagging you just in case the previous message from @wmgroot came unnoticed. I think the PR is good to go! good job!!
@wmgroot can we merge this?
Sure, go for it.
Sure, go for it. I wish I could! 😅
lol let me run a quick test locally, then I'll merge. I trust Matt's judgment but also it's always good to get one local run in. :-)
@crenshaw-dev can we merge this now?
Hello good sirs, does this have a chance to make it into 2.11? 😁
Hey @crenshaw-dev 👋🏻 !
This PR potentially fixes a few issues re stale syncs, do you feel more comfortable merging after testing it?
Thank you!!
Following up here @crenshaw-dev, did you get a chance to run locally? Any concerns? Thanks
Hey @wmgroot @crenshaw-dev 👋🏻
What can we do to help you get this merged?
Thanks!!
Codecov Report
Attention: Patch coverage is 78.57143%
with 3 lines
in your changes are missing coverage. Please review.
Project coverage is 44.94%. Comparing base (
5452bf1
) to head (7141668
). Report is 2 commits behind head on master.
Files | Patch % | Lines |
---|---|---|
...cationset/controllers/applicationset_controller.go | 78.57% | 1 Missing and 2 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #17296 +/- ##
=======================================
Coverage 44.94% 44.94%
=======================================
Files 354 354
Lines 47739 47745 +6
=======================================
+ Hits 21454 21457 +3
- Misses 23482 23483 +1
- Partials 2803 2805 +2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@wmgroot @crenshaw-dev I'm really sorry to insist. What's preventing us from merging this PR? Is there any other concern?
@wparr-circle thanks a lot for this fix! I have been facing this issue ever since I started using progressive syncs last year. @crenshaw-dev any chance to get this merged soon?
Plus one on the above @crenshaw-dev. It would be great to get this merged off, as we would like to be able to use the feature 🙏
Anything I can do to help with any concerns, or just not had time?
After a week from applying this fix and hundreds of syncs being triggered, the issue seems to be gone. Thanks a lot once again @wparr-circle! You saved me a lot of time on watching and triggering manunal syncs. 🙏
What I have noticed though is that the last application in the appset is left sometimes in a Pending
status after being synced. This happens some times, and it does not prevent the appset from triggering the next sync. After a new sync that status on the application is back to Healthy, some times as well. So, it does not affect negatively the appset sync capabilities, I just saw it as I was watching the appsets' applications closely to monitor the fix.
Anybody else noticed this?
Thanks for sharing @offzale, glad you're not seeing the same issue any more!
I haven't noticed any cases where the last application in the AppSet is left in a Pending status personally. I'll keep an eye out for anything like this as we roll out the new version 👀
Is there anything specific that seems to an issue with the Application? Could you share status, events or logs from around the time you observe it happening?