argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

Argo CD 2.5.10 Incorrectly Pruned

Open rumstead opened this issue 1 year ago • 15 comments

Checklist:

  • [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [] I've included steps to reproduce the bug.
  • [x] I've pasted the output of argocd version.

Describe the bug

We saw Argo CD mass prune an entire application incorrectly. Sadly, these resources caused a rather large cascading effect, deleting the majority of our Kubernetes deployments.

After a hard refresh, Argo CD correctly identified that the manifests it pruned, shouldn't have been pruned. Additionally, a git ls-files confirms the files that were pruned are on the commit.

The log message below is very similar to this issue #2573 and I have a feeling it's semi-related. We are planning on upgrading to 2.8 soon but wanted to see if anyone knows of an issue or fix that was put in v2.5.10+ already.

To Reproduce Been trying :-/

Expected behavior Manifests in Git shouldn't be pruned

Screenshots Attached are two screenshots showing the prune and then Argo CD saying they are out of sync after a hard sync.

Version

argocd-server: v2.5.10+d311fad
  BuildDate: 2023-02-02T15:10:21Z
  GitCommit: d311fad538178ef97528edf7f09d1d5b57268f45
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:48Z
  Helm Version: v3.10.3+g835b733
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.18.0

Logs

time="2023-08-12T17:25:47Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://...,Path:apps/,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:&ApplicationSourceDirectory{Recurse:true,Jsonnet:ApplicationSourceJsonnet{ExtVars:[]JsonnetVar{},TLAs:[]JsonnetVar{},Libs:[],},Exclude:,Include:,},Plugin:nil,Chart:,}/9b9ff97b11d83373bdd4d05fadcf20c791e5600e"

time="2023-08-12T17:26:25Z" level=info msg=Syncing application=namespace/tenants skipHooks=false started=false syncId=00479-rtgeN

time="2023-08-12T17:26:25Z" level=info msg="Updating operation state. phase: Running -> Running, message: '' -> 'one or more tasks are running'" application=namespace/tenants syncId=00479-rtgeN

2023-08-12T17:26:25 ons-service obj->nil (,,), Sync/0 resource internal.com/Tenant:/app-ui obj->nil (,,), Sync/0 resource internal.com/Tenant:/preview-portal obj->nil (,,), Sync/0 resource internal.com/Tenant:/printing-press obj->nil (,,), Sync/0 resource internal.com/Tenant:/product-service-tenant obj->nil (,,), Sync/0 resource internal.com/Tenant:/provider-admin-api-tenant obj->nil (,,), Sync/0 resource internal.com/Tenant:/provider-call-or-distribution-validation-service-tenant obj->nil (,,), Sync/0 resource internal.com/Tenant:/provider-email-service-tenant obj->nil (,,)...]"

time="2023-08-12T17:26:25Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=namespace/tenants kind=Tenant name=provider-email-service-tenant namespace= phase=Sync syncId=00479-rtgeN

... many many more pruned

time="2023-08-12T17:26:26Z" level=info msg="sync/terminate complete" application=namespace/tenants duration=1.014547622s syncId=00479-rtgeN

time="2023-08-12T17:26:26Z" level=info msg="updated 'namespace/tenants' operation (phase: Succeeded)"

time="2023-08-12T17:25:52Z" level=info msg="Reconciliation completed" application=namespace/tenants dedup_ms=3 dest-
name= dest-namespace=namespace2 dest-server="https://kubernetes.default.svc" diff_ms=854 fields.level=2 git_ms=5581 health_ms=2 live_ms=90 settings_ms=0 sync_ms=3 time_ms=10483

Screenshot 2023-08-12 at 7 46 45 PM copy Screenshot 2023-08-12 at 7 46 15 PM copy

rumstead avatar Aug 15 '23 13:08 rumstead

It deleted every single resource? I thought there was a failsafe which would block syncs if every resource was marked to be pruned.

crenshaw-dev avatar Aug 15 '23 13:08 crenshaw-dev

It deleted every single resource? I thought there was a failsafe which would block syncs if every resource was marked to be pruned.

it deleted every resource :-/ deployed by that application and then the finalizers took over downstream.

rumstead avatar Aug 15 '23 13:08 rumstead

I feel like the only real hint from the logs is all the obj->nil messages.

crenshaw-dev avatar Aug 15 '23 13:08 crenshaw-dev

Agreed, from my understanding of the code:

  1. I believe we are seeing the obj->nil logs from here based on the String() method.
    • where the live obj isn't nil but the target obj is since we see obj -> nil
  2. here is where i believe we are getting the nil results
    • I have no idea what local manifests are and just assume they are empty because i can't find them really anywhere in the code other than saying its for testing

rumstead avatar Aug 15 '23 13:08 rumstead

Local manifests support the argocd app sync --local feature, i.e. temporary overrides from a local manifest source. Probably not relevant in this case.

Feels like the problem must be similar to https://github.com/argoproj/argo-cd/issues/2573 - some error being ignored silently.

crenshaw-dev avatar Aug 15 '23 13:08 crenshaw-dev

The resources deployed are a CRD that we created. I ran them through the code a few times without any issues unmarshaling and after the hard refresh it worked. It didn't like something :-/.

Do you happen to know if there is a code to safeguard against pruning everything?

rumstead avatar Aug 15 '23 13:08 rumstead

There is https://github.com/argoproj/argo-cd/blob/94b986da2e99e0b0a76a9d3bae5032c04d6c4395/controller/appcontroller.go#L1728

crenshaw-dev avatar Aug 15 '23 13:08 crenshaw-dev

The CRDs are cluster scoped

❯ k api-resources --api-group=internal.com
NAME      SHORTNAMES   APIVERSION               NAMESPACED   KIND
tenants   tn,tnt       internal.com/v1alpha1   false        Tenant

one thing we also noticed.

In the prune case, the sync tasks look like this (ie, a missing namespace between the objects)

 Sync/0 resource internal.com/Tenant:/app-ui obj->nil

After the hard refresh, they look like this

 Sync/0 resource internal.com/Tenant:namespace/app-ui obj->obj

rumstead avatar Aug 15 '23 14:08 rumstead

Hm. Did you make changes to the CRD recently to move from namespaced to non-namespaced?

Figuring that the prune happened against revision 9b9ff97b11d83373bdd4d05fadcf20c791e5600e, have you had a look at repository server logs? Have there been some errors recorded for rendering this particular revision?

jannfis avatar Aug 15 '23 15:08 jannfis

Hm. Did you make changes to the CRD recently to move from namespaced to non-namespaced?

Figuring that the prune happened against revision 9b9ff97b11d83373bdd4d05fadcf20c791e5600e, have you had a look at repository server logs? Have there been some errors recorded for rendering this particular revision?

No recent changes in the CRD, it has always been cluster scoped but the Argo CD Application that deploys the Tenant resources has a namespace defined in it (though it always has). No errors in the repo server logs against that commit :(.

The only logs I have tracked own in the repo server are the GenerateManfiest grpc call from getRepoObjs and the subsequent git checkout --force.

time="2023-08-12T17:25:47Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://...,Path:apps/,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:&ApplicationSourceDirectory{Recurse:true,Jsonnet:ApplicationSourceJsonnet{ExtVars:[]JsonnetVar{},TLAs:[]JsonnetVar{},Libs:[],},Exclude:,Include:,},Plugin:nil,Chart:,}/9b9ff97b11d83373bdd4d05fadcf20c791e5600e"

and then further cache hits

time="2023-08-12T17:25:54Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:https://...,Path:apps/,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:&ApplicationSourceDirectory{Recurse:true,Jsonnet:ApplicationSourceJsonnet{ExtVars:[]JsonnetVar{},TLAs:[]JsonnetVar{},Libs:[],},Exclude:,Include:,},Plugin:nil,Chart:,}/9b9ff97b11d83373bdd4d05fadcf20c791e5600e"

EDIT: since targetObj was nil, it uses liveObj. Since liveObj doesn't have the namespace (its cluster scoped) no namespace is used

EDIT2: Prune does indeed work... (duh!)

Screenshot 2023-08-15 at 3 49 09 PM

rumstead avatar Aug 15 '23 15:08 rumstead

We have tried relentlessly to reproduce this and haven't been able to. I am going to close it and will reopen if we get more details. Thanks for everyone's suggestions. Hopefully, we don't see this again.

rumstead avatar Aug 21 '23 15:08 rumstead

Hi,

sorry to reopen this issue but we have experimented the same behaviour. In the logs we see that application-controller ends a Reconciliation in both pods (we have HA) without changes:

Log PodName
time="2024-04-23T11:13:09Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:09Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:09Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:[email protected]:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:09Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:[email protected]:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:09Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=11 unmarshal_ms=10 version_ms=0 argocd-application-controller-1
time="2024-04-23T11:13:09Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=8 plugins_ms=0 repo_ms=0 time_ms=16 unmarshal_ms=8 version_ms=0 argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Skipping auto-sync: application status is Synced" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Skipping auto-sync: application status is Synced" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:09Z" level=info msg="No status changes. Skipping patch" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Reconciliation completed" application=argocd/sdk-builder dedup_ms=0 dest-name= dest-namespace=sdk-builder dest-server="https://kubernetes.default.svc" diff_ms=35 fields.level=1 git_ms=16 health_ms=2 live_ms=9 settings_ms=0 sync_ms=0 time_ms=92 argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Reconciliation completed" application=argocd/sdk-builder dedup_ms=0 dest-name= dest-namespace=sdk-builder dest-server="https://kubernetes.default.svc" diff_ms=43 fields.level=1 git_ms=11 health_ms=2 live_ms=5 settings_ms=0 sync_ms=0 time_ms=91 argocd-application-controller-1
time="2024-04-23T11:13:09Z" level=info msg="No status changes. Skipping patch" application=argocd/sdk-builder argocd-application-controller-1

After just 10 seconds, both application servers starts again the resync process to the same git commit (1e189c27189403969bd31706760e7a43c2fc833a). In that case, the repo server reports a cache miss and starts a new sync which starts to prune resources:

Log PodName
time="2024-04-23T11:13:14Z" level=info msg="Start processing" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:14Z" level=info msg="Processing completed" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:24Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:24Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:24Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:[email protected]:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:24Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:24Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:24Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:[email protected]:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:26Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:[email protected]:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:26Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:[email protected]:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:28Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=3635 unmarshal_ms=3635 version_ms=0 argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Start processing" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="Processing completed" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="Initiated automated sync to '1e189c27189403969bd31706760e7a43c2fc833a'" application=sdk-builder dest-namespace=sdk-builder dest-server="https://kubernetes.default.svc" reason=OperationStarted type=Normal argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Initiated automated sync to '1e189c27189403969bd31706760e7a43c2fc833a'" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Updated sync status: Synced -> OutOfSync" application=sdk-builder dest-namespace=sdk-builder dest-server="https://kubernetes.default.svc" reason=ResourceUpdated type=Normal argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Processing completed" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="Start processing" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="Initialized new operation: {&SyncOperation{Revision:1e189c27189403969bd31706760e7a43c2fc833a,Prune:true,DryRun:false,SyncStrategy:nil,Resources:[]SyncOperationResource{SyncOperationResource{Group:,Kind:ServiceAccount,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder-mrfio-https,Namespace:,},SyncOperationResource{Group:apps,Kind:Deployment,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:rbac.authorization.k8s.io,Kind:RoleBinding,Name:sdk-builder-rolebinding,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-cluster,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:VirtualService,Name:sdk-builder-leaderelection,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:VirtualService,Name:sdk-builder-vs,Namespace:,},SyncOperationResource{Group:,Kind:Secret,Name:sdk-builder-env,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdkmrfio-gw,Namespace:,},SyncOperationResource{Group:policy,Kind:PodDisruptionBudget,Name:sdk-builder-pdb,Namespace:,},SyncOperationResource{Group:,Kind:Namespace,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder-cluster-https,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-mrfio,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:ServiceEntry,Name:sdk-mrf-io-cloudfront,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-leaderelection,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:DestinationRule,Name:sdk-mrfio-dest,Namespace:,},SyncOperationResource{Group:monitoring.coreos.com,Kind:ServiceMonitor,Name:appmetrics-sdk-builder,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:DestinationRule,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:rbac.authorization.k8s.io,Kind:Role,Name:sdk-builder-role,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder.leaderelection.mrf.io,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdkmrfio-https,Namespace:,},SyncOperationResource{Group:,Kind:Service,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:autoscaling,Kind:HorizontalPodAutoscaler,Name:sdk-builder-autoscaler,Namespace:,},SyncOperationResource{Group:security.istio.io,Kind:AuthorizationPolicy,Name:sdk-builder,Namespace:,},},Source:nil,Manifests:[],SyncOptions:[ServerSideApply=true],Sources:[]ApplicationSource{},Revisions:[],} { true} [] {5 nil}}" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="updated 'argocd/sdk-builder' operation (phase: Running)" argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:[email protected]:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-6q492
time="2024-04-23T11:13:28Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=7 unmarshal_ms=7 version_ms=0 argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Initialized new operation: {&SyncOperation{Revision:1e189c27189403969bd31706760e7a43c2fc833a,Prune:true,DryRun:false,SyncStrategy:nil,Resources:[]SyncOperationResource{SyncOperationResource{Group:,Kind:ServiceAccount,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder-mrfio-https,Namespace:,},SyncOperationResource{Group:apps,Kind:Deployment,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:rbac.authorization.k8s.io,Kind:RoleBinding,Name:sdk-builder-rolebinding,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-cluster,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:VirtualService,Name:sdk-builder-leaderelection,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:VirtualService,Name:sdk-builder-vs,Namespace:,},SyncOperationResource{Group:,Kind:Secret,Name:sdk-builder-env,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdkmrfio-gw,Namespace:,},SyncOperationResource{Group:policy,Kind:PodDisruptionBudget,Name:sdk-builder-pdb,Namespace:,},SyncOperationResource{Group:,Kind:Namespace,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder-cluster-https,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-mrfio,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:ServiceEntry,Name:sdk-mrf-io-cloudfront,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-leaderelection,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:DestinationRule,Name:sdk-mrfio-dest,Namespace:,},SyncOperationResource{Group:monitoring.coreos.com,Kind:ServiceMonitor,Name:appmetrics-sdk-builder,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:DestinationRule,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:rbac.authorization.k8s.io,Kind:Role,Name:sdk-builder-role,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder.leaderelection.mrf.io,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdkmrfio-https,Namespace:,},SyncOperationResource{Group:,Kind:Service,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:autoscaling,Kind:HorizontalPodAutoscaler,Name:sdk-builder-autoscaler,Namespace:,},SyncOperationResource{Group:security.istio.io,Kind:AuthorizationPolicy,Name:sdk-builder,Namespace:,},},Source:nil,Manifests:[],SyncOptions:[ServerSideApply=true],Sources:[]ApplicationSource{},Revisions:[],} { true} [] {5 nil}}" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="updated 'argocd/sdk-builder' operation (phase: Running)" argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg=Syncing application=argocd/sdk-builder skipHooks=true started=false syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:[email protected]:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-6q492
time="2024-04-23T11:13:28Z" level=info msg="Processing completed" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="Start processing" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=7 unmarshal_ms=7 version_ms=0 argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Reconciliation completed" application=argocd/sdk-builder dedup_ms=0 dest-name= dest-namespace=sdk-builder dest-server="https://kubernetes.default.svc" diff_ms=24 fields.level=1 git_ms=3635 health_ms=2 live_ms=6 settings_ms=0 sync_ms=0 time_ms=3848 argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Update successful" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Tasks (dry-run)" application=argocd/sdk-builder syncId=05403-RNPms tasks="[Sync/0 resource /Namespace:/sdk-builder obj->nil (,,), Sync/0 resource policy/PodDisruptionBudget:sdk-builder/sdk-builder-pdb obj->nil (,,), Sync/0 resource /ServiceAccount:sdk-builder/sdk-builder obj->nil (,,), Sync/0 resource /Secret:sdk-builder/sdk-builder-env obj->nil (,,), Sync/0 resource rbac.authorization.k8s.io/Role:sdk-builder/sdk-builder-role obj->nil (,,), Sync/0 resource rbac.authorization.k8s.io/RoleBinding:sdk-builder/sdk-builder-rolebinding obj->nil (,,), Sync/0 resource /Service:sdk-builder/sdk-builder obj->nil (,,), Sync/0 resource apps/Deployment:sdk-builder/sdk-builder obj->nil (,,), Sync/0 resource autoscaling/HorizontalPodAutoscaler:sdk-builder/sdk-builder-autoscaler obj->nil (,,), Sync/0 resource monitoring.coreos.com/ServiceMonitor:sdk-builder/appmetrics-sdk-builder obj->nil (,,), Sync/0 resource security.istio.io/AuthorizationPolicy:sdk-builder/sdk-builder obj->nil (,,), Sync/0 resource networking.istio.io/DestinationRule:sdk-builder/sdk-builder obj->nil (,,), Sync/0 resource networking.istio.io/Gateway:sdk-builder/sdk-builder-cluster obj->nil (,,), Sync/0 resource cert-manager.io/Certificate:istio-system/sdk-builder-cluster-https obj->nil (,,), Sync/0 resource networking.istio.io/VirtualService:sdk-builder/sdk-builder-leaderelection obj->nil (,,), Sync/0 resource networking.istio.io/Gateway:sdk-builder/sdk-builder-leaderelection obj->nil (,,), Sync/0 resource networking.istio.io/Gateway:sdk-builder/sdk-builder-mrfio obj->nil (,,), Sync/0 resource cert-manager.io/Certificate:istio-system/sdk-builder-mrfio-https obj->nil (,,), Sync/0 resource networking.istio.io/VirtualService:sdk-builder/sdk-builder-vs obj->nil (,,), Sync/0 resource cert-manager.io/Certificate:istio-system/sdk-builder.leaderelection.mrf.io obj->nil (,,), Sync/0 resource networking.istio.io/ServiceEntry:sdk-builder/sdk-mrf-io-cloudfront obj->nil (,,), Sync/0 resource networking.istio.io/DestinationRule:sdk-builder/sdk-mrfio-dest obj->nil (,,), Sync/0 resource networking.istio.io/Gateway:sdk-builder/sdkmrfio-gw obj->nil (,,), Sync/0 resource cert-manager.io/Certificate:istio-system/sdkmrfio-https obj->nil (,,)]" argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Updating operation state. phase: Running -> Running, message: '' -> 'one or more tasks are running'" application=argocd/sdk-builder syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=Role name=sdk-builder-role namespace=sdk-builder phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=DestinationRule name=sdk-builder namespace=sdk-builder phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=ServiceAccount name=sdk-builder namespace=sdk-builder phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:[email protected]:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=RoleBinding name=sdk-builder-rolebinding namespace=sdk-builder phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=8 unmarshal_ms=7 version_ms=0 argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=Certificate name=sdk-builder-mrfio-https namespace=istio-system phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=Certificate name=sdkmrfio-https namespace=istio-system phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=Gateway name=sdkmrfio-gw namespace=sdk-builder phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg=Syncing application=argocd/sdk-builder skipHooks=true started=false syncId=05399-kEkVM argocd-application-controller-0

After that the application enters an inestable status where argo is not able to prune all the resources but, at the same time, the deployment is pruned so, not able to respond to requests until we force a manual sync with Argo. We have some questions that we are not able to answer:

  • we have the same setup, same version and same application on 5 different k8s clusters and we never had the same problem simultaneously (or close in time) on 2 clusters.
  • we have timeout.reconciliation and timeout.reconciliation.jitter without setting so with the default values and we see lots of status checks. Is this normal? what triggers a status check?
  • we are seeing both application-controller checking for the status at the same time. Is that correct? Does it exist any way to not duplicate the effort without impacting HA?
  • We are using an argocd plugin. Could this have an impact? As far as we tested, we are unable to, from the same deployment repo, generate different outputs so, we assume that this couldn't impact on the problem. Nevertheless, there exists no way to log this in terms of argocd (we are thinking of putting a log on the plugin that guarantees that the result is always the same just to see edge cases).
  • does there exist any other information that we can add to help resolve this bug?

Finally, we are using a quite old ArgoCD version (2.6.6) and we are planning to update during the next weeks, even though, we haven't been able to find any bug fix that could address this problem. Does it exist? We have attached the full logs if it could help (argocd.csv).

ArgoCD version:

argocd: v2.6.6+6d4de2e
  BuildDate: 2023-03-16T22:25:45Z
  GitCommit: 6d4de2ec5d49fa2c6823f2b7d101607a839be3fa
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: linux/amd64

Many thanks for your help

joantomas avatar Apr 25 '24 07:04 joantomas

Very sorry this happened, it was extremely destructive for us. I feel a lot less crazy now that someone else experienced it.

Similar to what you alluded to, the only semi-explanation I could come up with was redis returning empty cached manifests. Idk if it was caused by the repo server storing empty manifests because it swallowed an error while rendering or if something failed while storing the manifests in redis.

I also cannot explain how the "don't prune everything check" was bypassed.

rumstead avatar Apr 26 '24 15:04 rumstead

To add to this, we have seen this bug happen in 2.10.7. For our situation, Argo showed a task dry-run with all cluster scoped resources managed by the Argo App as having a target of nil, but the non-cluster scoped (namespaced) resources rendered just fine. So, Argo pruned the cluster scoped resources like crds, cluster-role bindings, cluster roles

Note, that we have progressive sync, aka rollingsync, enabled and this app is controlled by an appset that is using that feature. What was weird is that this synced an app that shouldn't have been in an active sync phase due to us having a manual gate to that phase. We had synced an earlier phase, so again the phase that synced and pruned shouldn't have even started. The app that synced also wasn't in an out of sync state before.

ericblackburn avatar May 07 '24 01:05 ericblackburn