argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

fix(applicationset): prevent applicationset progressive sync from stalling in pending

Open wparr-circle opened this issue 1 year ago • 19 comments

Fixes https://github.com/argoproj/argo-cd/issues/12459

ApplicationSet Controller Progressive Sync sometimes looks to get into a state where the Applications are never synced and a rollout never concludes, based on the timestamps where an ApplicationSet may have transitioned as time T and triggered a sync - but somehow an Application Sync Operation indicates that it was triggered much earlier at T-10..N. The disparity looks to possibly be in the order of minutes, we've seen it happen with ~20-30 seconds of disparity. However based on #12459 it looks like could be much more. There is some existing behaviour implemented in https://github.com/argoproj/argo-cd/pull/13926 - which manages to resolve cases where the disparity is lesser than 10s - but doesn't help resolve cases where it exceed this.

This pull resolves the issue by setting an identifier for a sync triggered by the ApplicationSet controller, so it may be compared to the details of an operation ie. it was triggered by the applicationset controller and has the expected id, while still allowing other syncs triggered by other users to result in transitioning from Pending to Progressing provided the sync occured after ApplicationSet Controller transitioned the Application to Pending.

This does not patch a regression, as it's affected previous versions as early as 2.6, 2.7 and still affects 2.10. It may require cherry picking to older versions.

Checklist:

  • [X] Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • [x] The title of the PR states what changed and the related issues number (used for the release note).
  • [x] The title of the PR conforms to the Toolchain Guide
  • [x] I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • [x] I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • [ ] Does this PR require documentation updates?
  • [ ] I've updated documentation as required by this PR.
  • [x] I have signed off all my commits as required by DCO
  • [x] I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • [x] My build is green (troubleshooting builds).
  • [ ] My new feature complies with the feature status guidelines.
  • [x] I have added a brief description of why this PR is necessary and/or what this PR solves.
  • [ ] Optional. My organization is added to USERS.md.
  • [x] Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

wparr-circle avatar Feb 23 '24 19:02 wparr-circle

Beyond using a UUID etc. for tracking a sync triggered by applicationset-controller - I wonder if we should store the revision of a sync when it's out of date / new, which I believe would make such an id redundant. ie. TargetRevision: app.Status.Sync.Revision, and during comparison ensure that the operation sync matches the target revision app.Status.OperationState.SyncResult.Revision == currentAppStatus.TargetRevision.

wparr-circle avatar Feb 26 '24 10:02 wparr-circle

Would this fix be possible without a CRD change? Usually, we have a policy of not cherry-picking changes to CRDs back into already released branches, for update simplicity.

If it's not possible without CRD change, I think we may want to forward-fix this only.

jannfis avatar Mar 01 '24 19:03 jannfis

@jannfis Thanks for the feedback. I'm not sure that in either of the approaches suggested on my part ie. usage of a generated ID (the contents of this PR as of 7c14437), or using out of sync revisions which I've drafted up an additional change targeting my fork here: https://github.com/wparr-circle/argo-cd/pull/1

So I think as you say, this will have to be a forward only change.

Beyond that, I'd like to open for feedback on the thoughts of the maintainers as to which approach they prefer (personally I'm thinking that the approach in https://github.com/wparr-circle/argo-cd/pull/1 is better. However, I'm not 100% if these fields are always available as I am assuming.

wparr-circle avatar Mar 04 '24 18:03 wparr-circle

@wparr-circle I took a look at the PR in your fork, and I also like that change better than the original. It's much simpler and easier to read.

However, I'm not 100% if these fields are always available as I am assuming.

You will get the target revisions as long as the application has synced at least once, otherwise, you'll get an empty list. What exactly is your assumption about this field, i.e. what properties does it need to match to suite your logic?

jannfis avatar Mar 04 '24 23:03 jannfis

@jannfis my assumption was along the lines of what you replied here:

You will get the target revisions as long as the application has synced at least once, otherwise, you'll get an empty list

ie. that it'll be populated with some value in all cases of git, kustomize, helm etc. provided the app has been synced once.

And that the fields of the struct:

  • app.Status.OperationState.SyncResult.Revisions / app.Status.OperationState.SyncResult.Revision - Corresponds to the last revision that a sync operation resolved to.
  • app.Sync.Revisions - Corresponds to the last revision that a reconcile resolved to. ie. when out of sync this will be different from the former when auto-sync is disabled or the sync operation otherwise hasn't run yet.

I've updated this PR with the contents from https://github.com/wparr-circle/argo-cd/pull/1

wparr-circle avatar Mar 05 '24 08:03 wparr-circle

Following up on this @jannfis, any chance a maintainer could have a look over this.

wparr-circle avatar Mar 21 '24 12:03 wparr-circle

@wmgroot any concerns about the approach?

crenshaw-dev avatar Apr 01 '24 14:04 crenshaw-dev

This approach looks good to me.

wmgroot avatar Apr 01 '24 18:04 wmgroot

@wparr-circle just tagging you just in case the previous message from @wmgroot came unnoticed. I think the PR is good to go! good job!!

fcrespofastly avatar Apr 03 '24 08:04 fcrespofastly

@wmgroot can we merge this?

fcrespofastly avatar Apr 04 '24 18:04 fcrespofastly

Sure, go for it.

wmgroot avatar Apr 04 '24 19:04 wmgroot

Sure, go for it. I wish I could! 😅

fcrespofastly avatar Apr 04 '24 20:04 fcrespofastly

lol let me run a quick test locally, then I'll merge. I trust Matt's judgment but also it's always good to get one local run in. :-)

crenshaw-dev avatar Apr 04 '24 20:04 crenshaw-dev

@crenshaw-dev can we merge this now?

aliazlan-t avatar Apr 12 '24 20:04 aliazlan-t

Hello good sirs, does this have a chance to make it into 2.11? 😁

gmauleon avatar Apr 20 '24 04:04 gmauleon

Hey @crenshaw-dev 👋🏻 !

This PR potentially fixes a few issues re stale syncs, do you feel more comfortable merging after testing it?

Thank you!!

fcrespofastly avatar Apr 23 '24 20:04 fcrespofastly

Following up here @crenshaw-dev, did you get a chance to run locally? Any concerns? Thanks

wparr-circle avatar Apr 29 '24 16:04 wparr-circle

Hey @wmgroot @crenshaw-dev 👋🏻

What can we do to help you get this merged?

Thanks!!

fcrespofastly avatar May 13 '24 20:05 fcrespofastly

Codecov Report

Attention: Patch coverage is 78.57143% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 44.94%. Comparing base (5452bf1) to head (7141668). Report is 2 commits behind head on master.

Files Patch % Lines
...cationset/controllers/applicationset_controller.go 78.57% 1 Missing and 2 partials :warning:
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #17296   +/-   ##
=======================================
  Coverage   44.94%   44.94%           
=======================================
  Files         354      354           
  Lines       47739    47745    +6     
=======================================
+ Hits        21454    21457    +3     
- Misses      23482    23483    +1     
- Partials     2803     2805    +2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 23 '24 10:05 codecov[bot]

@wmgroot @crenshaw-dev I'm really sorry to insist. What's preventing us from merging this PR? Is there any other concern?

fcrespofastly avatar Jun 03 '24 20:06 fcrespofastly

@wparr-circle thanks a lot for this fix! I have been facing this issue ever since I started using progressive syncs last year. @crenshaw-dev any chance to get this merged soon?

offzale avatar Jun 05 '24 12:06 offzale

Plus one on the above @crenshaw-dev. It would be great to get this merged off, as we would like to be able to use the feature 🙏

Anything I can do to help with any concerns, or just not had time?

wparr-circle avatar Jun 05 '24 22:06 wparr-circle

After a week from applying this fix and hundreds of syncs being triggered, the issue seems to be gone. Thanks a lot once again @wparr-circle! You saved me a lot of time on watching and triggering manunal syncs. 🙏

What I have noticed though is that the last application in the appset is left sometimes in a Pending status after being synced. This happens some times, and it does not prevent the appset from triggering the next sync. After a new sync that status on the application is back to Healthy, some times as well. So, it does not affect negatively the appset sync capabilities, I just saw it as I was watching the appsets' applications closely to monitor the fix.

Anybody else noticed this?

offzale avatar Jun 13 '24 09:06 offzale

Thanks for sharing @offzale, glad you're not seeing the same issue any more!

I haven't noticed any cases where the last application in the AppSet is left in a Pending status personally. I'll keep an eye out for anything like this as we roll out the new version 👀

Is there anything specific that seems to an issue with the Application? Could you share status, events or logs from around the time you observe it happening?

wparr-circle avatar Jun 28 '24 13:06 wparr-circle