argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

ArgoCD fails on hook-finalizer (after upgrade to 3.x)

Open ep4sh opened this issue 6 months ago • 61 comments

Checklist:

  • [X] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [X] I've included steps to reproduce the bug.
  • [X] I've pasted the output of argocd version.

Describe the bug

I have a Job (db migrations) with hook-weight == "-5" and another resource, let's say ConfigMap with hook-weight == "-10". Job is long-running task, after the job finishes it seems that argocd can't properly finalize the sync process: it successfully removes CM but cant remove Job and eventually fails by timeout (context deadline).

This behavior is actual since I've upgraded to 3.x from 2.x

If I manually remove the finalizer from the Job, the synchronization process completes OK.

Image

To Reproduce

Create Job with hook-weight == "-5", CM hook-weight == "-10". Sync.

Expected behavior

ArgoCD should properly handle finalizer / helm-hooks

Screenshots

Version

v3.0.3+a14b012

ep4sh avatar Jun 02 '25 16:06 ep4sh

I have the same issue on 3.0.5. I have it on a ConfigMap that has helm.sh/hook-weight: "-100" but also on a Job that doesn't have hook-weight set but does have argocd.argoproj.io/hook-delete-policy: BeforeHookCreation

In both cases the finalizer is argocd.argoproj.io/hook-finalizer

lonnix avatar Jun 05 '25 16:06 lonnix

Any updates on this? I heavily rely on spinning up my environment with PreSync hooks to populate DB with data.

yunasc avatar Jun 14 '25 07:06 yunasc

I'll take a look at this

kingbj940429 avatar Jun 16 '25 12:06 kingbj940429

I'm encountering the same issue. I held off on upgrading to v3 for most clusters—only one of my clusters is currently running v3, and it's the only one affected. All clusters share the same configuration, so this seems specific to v3.

isarns avatar Jun 17 '25 12:06 isarns

I can't reproduce this issue

here yamls I used

apiVersion: batch/v1
kind: Job
metadata:
  name: test
  namespace: default
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-weight: "-5"
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: main
          image: busybox
          command: ["sh", "-c", "echo 'hello' && sleep 10"]
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: test
  namespace: default
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-weight: "-10"
data:
  MESSAGE: "hello-from-configmap"

finalizer was removed well after job is completed (ignore createTimestamp) Image

Image

kingbj940429 avatar Jun 17 '25 16:06 kingbj940429

Hi @kingbj940429, thanks for taking a look at this.

We're using the following Argo CD annotations:

argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
argocd.argoproj.io/sync-options: Force=true,Replace=true
argocd.argoproj.io/sync-wave: '-200'

I’ve been able to consistently reproduce the issue. It appears to occur specifically when a sync is attempted after a job failure.

To reproduce:

  1. Trigger a sync and let the job start.
  2. Terminate the sync while the job is still running.
  3. Attempt to sync again.

In v2.14, this would result in everything being deleted and re-synced. However, in the current behavior, the sync simply hangs and does not progress.

Let me know if I can provide more details!

isarns avatar Jun 18 '25 07:06 isarns

Hi @isarns thanks for your information

after pod by job is completed, job wasn't removed because of job's finalizer argocd.argoproj.io/hook-finalizer

so syncing status is continued

this Issue is right? Image

If right, this issue occurs when use replace/force together as sync option like below

apiVersion: batch/v1
kind: Job
metadata:
  name: test
  namespace: default
  annotations:
    helm.sh/hook: pre-upgrade
    argocd.argoproj.io/sync-options: Force=true,Replace=true #-- or check replace, force at Sync UI
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: main
          image: busybox
          command: ["sh", "-c", "echo 'hello' && sleep 3"]

kingbj940429 avatar Jun 18 '25 15:06 kingbj940429

Yes, that's correct. We use Force and Replace for the hook.

yunasc avatar Jun 18 '25 20:06 yunasc

@kingbj940429 yes this is the issue we encounter

isarns avatar Jun 19 '25 14:06 isarns

~I took a close look at the code, and this looks more like a bug in how Helm Template works rather than anything related to argocd.argoproj.io/sync-options: Force=true,Replace=true~

~the same issue happens even when those sync-options aren’t set~

~Looking at the code, it uses the helm template command to render the YAML, and of course, the rendered YAML doesn’t include any finalizers~ https://github.com/argoproj/argo-cd/blob/414d9eb5dba31697f8782edeb54c7f4410623609/util/helm/cmd.go#L433-L433

~created like this, there is no finalizer field because I didn't specify finalizer in helm~

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    argocd.argoproj.io/tracking-id: test:batch/Job:default/job-test
    helm.sh/hook: pre-upgrade
  labels:
    batch.kubernetes.io/controller-uid: 98cc82ed-86f8-4e9d-93b2-79de6cfcfd9c
    batch.kubernetes.io/job-name: job-test
    controller-uid: 98cc82ed-86f8-4e9d-93b2-79de6cfcfd9c
    job-name: job-test
  name: job-test
  namespace: default

~But when the code below runs, since the YAML rendered by helm template command doesn’t have the finalizer, it doesn’t remove it—instead, it ends up adding it.~

https://github.com/argoproj/gitops-engine/blob/093aef0dad58015619479509f0d2ac71cc9cefd7/pkg/sync/sync_context.go#L789-L788

~so a job created by helm is not deleted~

kingbj940429 avatar Jun 22 '25 09:06 kingbj940429

@kingbj940429 Thank you for checking this out!

But we are not using helm to deploy the jobs. and still get the same issue. as I mentioned before the job/secret are deployed with

argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
argocd.argoproj.io/sync-options: Force=true,Replace=true
argocd.argoproj.io/sync-wave: '-200' # depends on the resource 

isarns avatar Jun 22 '25 10:06 isarns

We are seeing this issue as well, we however are not using Force=true or Replace=true, but we do have argocd.argoproj.io/hook-delete-policy: BeforeHookCreation

ssoriche avatar Jun 25 '25 21:06 ssoriche

@crenshaw-dev

Could I get some help with this issue? I’ve tried a few things for fixing, but I’m not sure what else to do.

Many people are experiencing this issue.

kingbj940429 avatar Jun 29 '25 05:06 kingbj940429

Hi team,

Could we please get an update on this issue? It's currently impacting our workflow, and we’re evaluating the possibility of downgrading as a workaround.

Your guidance would be appreciated—thanks!

isarns avatar Jun 30 '25 15:06 isarns

Hi,

We are having the same problem! is there any solution available for this problem?

Thanks

sattar-sattari avatar Jul 01 '25 09:07 sattar-sattari

I haven't had time to drive into possible causes yet, but looks like we have enough information to write e2e tests producing at least two failure scenarios.

If anyone has time to set up those tests and open a draft PR to reproduce the issues, that would be greatly appreciated!

crenshaw-dev avatar Jul 01 '25 12:07 crenshaw-dev

@crenshaw-dev

On it! I’ll draft the e2e tests and open a draft PR, just point me to the test harness or docs and I’ll dive in. Thanks!

isarns avatar Jul 03 '25 13:07 isarns

Thanks! The docs here are reasonably up to date: https://github.com/argoproj/argo-cd/blob/master/test/e2e/hook_test.go

And you can probably copy/edit a test here: https://github.com/argoproj/argo-cd/blob/master/test/e2e/hook_test.go

crenshaw-dev avatar Jul 03 '25 14:07 crenshaw-dev

@crenshaw-dev

I’m running into an issue: the end-to-end test that requires a repository is failing with:

InvalidSpecError: repository not accessible

It looks like something’s broken on my side—could someone else take over this test, or point me toward any missing config steps I might’ve overlooked? Thanks!

isarns avatar Jul 08 '25 17:07 isarns

This consistently reproduces the issue: https://github.com/argoproj/argo-cd/pull/23697

The sync weight wasn't necessary, and the additional ConfigMap hook wasn't necessary.

We should fix the bug, but also... is Force=true,Replace=true actually useful here? The job gets deleted before a new sync starts, so there should never be an existing resource to replace.

crenshaw-dev avatar Jul 08 '25 18:07 crenshaw-dev

The developers in my org don't have delete permission. So when for example the migration job fails they patch the code and deploy it.

Before adding those annotations the sync would fail because a job's image is immutable.

And with the annotations its recreated.

isarns avatar Jul 09 '25 18:07 isarns

I don't understand how a PreSync hook with deletion policy BeforeHookCreation (the default) could ever encounter an update operation. Shouldn't the failed Job always be deleted and then the new configuration created?

crenshaw-dev avatar Jul 09 '25 18:07 crenshaw-dev

@crenshaw-dev in our case, it was a faulty Kyverno mutating policy that wasn't configured to look only at create events

PatTheSilent avatar Jul 11 '25 13:07 PatTheSilent

@PatTheSilent the faulty policy was causing the finalizer to not be removed, or it was causing you to need Force=true on the hook resource?

crenshaw-dev avatar Jul 11 '25 14:07 crenshaw-dev

@crenshaw-dev I couldn’t reproduce this at first until I realized the team only checked the migrator job after the sync failed. They bumped the image tag, pushed a new release, and then ArgoCD hit this:

error when patching "/dev/shm/3141082373": Job.batch "migrator" is invalid: spec.template: Invalid value: core.PodTemplateSpec{…}: field is immutable

Reproduction steps:

  1. Apply the migrator setup below
  2. Let ArgoCD run the PreSync hook and fail
  3. Change the tag in job/kustomization.yaml and push
  4. Watch the sync error about the immutable field

File Structure

migrator/
├── hello-world.yaml          # Test job - Alpine container
├── job/
│   ├── job.yaml              # Main migrator job 
│   └── kustomization.yaml    # Uses alpine:3.22 image
├── kustomization.yaml        # Combines both jobs
└── argocd-application-config.json

File Contents

hello-world.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: hello-world
spec:
  template:
    spec:
      containers:
        - name: hello-world
          image: alpine:latest
      restartPolicy: Never
job/job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: migrator
spec:
  backoffLimit: 1
  template:
    spec:
      containers:
        - name: migrator
          image: CHANGEME
          command: 
            - /bin/sh
            - -c
            - echo "Bacon pancakes Making Bacon Pancakes Take some bacon and ..." && sleep 1 && exit 1
      restartPolicy: Never
job/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - job.yaml

images:
  - name: CHANGEME
    newName: alpine
    newTag: "3.22" # change this to 3.21 and vice versa

commonAnnotations:
  argocd.argoproj.io/hook: PreSync
  argocd.argoproj.io/hook-delete-policy: HookSucceeded
kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - hello-world.yaml
  - job
argocd-application-config.json
{
  "app_name": "migrator",
  "namespace": "application",
  "revision": "HEAD"
}

Since a dev without delete permission can’t clear the old job, we added:

argocd.argoproj.io/sync-options: Force=true,Replace=true

isarns avatar Jul 13 '25 22:07 isarns

@PatTheSilent the faulty policy was causing the finalizer to not be removed, or it was causing you to need Force=true on the hook resource?

@crenshaw-dev It was causing the finalizer to not be removed. Each patch was going through Kyverno, which wanted to modify the job's spec, and it got rejected as the spec is immutable, so the patch that wanted to remove the finalizer was getting rejected.

PatTheSilent avatar Jul 16 '25 12:07 PatTheSilent

@isarns it looks to me as if the more important bug is that Argo CD isn't deleting the hook resource before creating a new one. What happens if you use argocd.argoproj.io/hook-delete-policy: BeforeCreation instead of HookSucceeded?

@PatTheSilent do you have details about what fields Argo CD was trying to patch besides the finalizer? That sounds like a separate bug. The patch should be literally just the finalizer field.

Edit: It looks like the finalizer is being removed via an Update operation instead of a Patch operation. Is the Kyverno hook just rejecting all Updates?

crenshaw-dev avatar Jul 16 '25 15:07 crenshaw-dev

@crenshaw-dev It was Kyverno that was trying to modify the Job's spec rather than ArgoCD. ArgoCD merely wanted to remove the hook but the hook removal was being rejected because Kyverno was modifying it with a mutation hook

mutate: {
  patchStrategicMerge: {
    spec: {
      template: {
        metadata: {
          annotations: {
            'cluster-autoscaler.kubernetes.io/safe-to-evict': 'false',
          },
        },
      },
    },
  },
}

It was completely our fault, not blaming ArgoCD here. Just leaving this as someone's problem here might be caused by the same thing. In our case, we changed the policy to mutate Pods rather than Jobs.

PatTheSilent avatar Jul 17 '25 09:07 PatTheSilent

Gotcha, thanks for clarifying!

So setting aside the problematic Kyverno hook, I think we're left with a hook whose finalizer removal is broken by Replace=true;Force=true, but it's not clear that we should even need force-replace in the first place.

crenshaw-dev avatar Jul 17 '25 14:07 crenshaw-dev

I share my same experience, job deploy by helm. With Replace=true;Force=true because without, we got the "immutable error" when the job content change.

I also have kyverno policy, I will check it but this was working before updating to v3.

mysiki avatar Jul 17 '25 19:07 mysiki