ArgoCD fails on hook-finalizer (after upgrade to 3.x)
Checklist:
- [X] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- [X] I've included steps to reproduce the bug.
- [X] I've pasted the output of
argocd version.
Describe the bug
I have a Job (db migrations) with hook-weight == "-5" and another resource, let's say ConfigMap with hook-weight == "-10". Job is long-running task, after the job finishes it seems that argocd can't properly finalize the sync process: it successfully removes CM but cant remove Job and eventually fails by timeout (context deadline).
This behavior is actual since I've upgraded to 3.x from 2.x
If I manually remove the finalizer from the Job, the synchronization process completes OK.
To Reproduce
Create Job with hook-weight == "-5", CM hook-weight == "-10". Sync.
Expected behavior
ArgoCD should properly handle finalizer / helm-hooks
Screenshots
Version
v3.0.3+a14b012
I have the same issue on 3.0.5. I have it on a ConfigMap that has helm.sh/hook-weight: "-100" but also on a Job that doesn't have hook-weight set but does have argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
In both cases the finalizer is argocd.argoproj.io/hook-finalizer
Any updates on this? I heavily rely on spinning up my environment with PreSync hooks to populate DB with data.
I'll take a look at this
I'm encountering the same issue. I held off on upgrading to v3 for most clusters—only one of my clusters is currently running v3, and it's the only one affected. All clusters share the same configuration, so this seems specific to v3.
I can't reproduce this issue
here yamls I used
apiVersion: batch/v1
kind: Job
metadata:
name: test
namespace: default
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-weight: "-5"
spec:
template:
spec:
restartPolicy: Never
containers:
- name: main
image: busybox
command: ["sh", "-c", "echo 'hello' && sleep 10"]
---
apiVersion: v1
kind: ConfigMap
metadata:
name: test
namespace: default
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-weight: "-10"
data:
MESSAGE: "hello-from-configmap"
finalizer was removed well after job is completed
(ignore createTimestamp)
Hi @kingbj940429, thanks for taking a look at this.
We're using the following Argo CD annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
argocd.argoproj.io/sync-options: Force=true,Replace=true
argocd.argoproj.io/sync-wave: '-200'
I’ve been able to consistently reproduce the issue. It appears to occur specifically when a sync is attempted after a job failure.
To reproduce:
- Trigger a sync and let the job start.
- Terminate the sync while the job is still running.
- Attempt to sync again.
In v2.14, this would result in everything being deleted and re-synced. However, in the current behavior, the sync simply hangs and does not progress.
Let me know if I can provide more details!
Hi @isarns thanks for your information
after pod by job is completed, job wasn't removed because of job's finalizer argocd.argoproj.io/hook-finalizer
so syncing status is continued
this Issue is right?
If right, this issue occurs when use replace/force together as sync option like below
apiVersion: batch/v1
kind: Job
metadata:
name: test
namespace: default
annotations:
helm.sh/hook: pre-upgrade
argocd.argoproj.io/sync-options: Force=true,Replace=true #-- or check replace, force at Sync UI
spec:
template:
spec:
restartPolicy: Never
containers:
- name: main
image: busybox
command: ["sh", "-c", "echo 'hello' && sleep 3"]
Yes, that's correct. We use Force and Replace for the hook.
@kingbj940429 yes this is the issue we encounter
~I took a close look at the code, and this looks more like a bug in how Helm Template works rather than anything related to argocd.argoproj.io/sync-options: Force=true,Replace=true~
~the same issue happens even when those sync-options aren’t set~
~Looking at the code, it uses the helm template command to render the YAML, and of course, the rendered YAML doesn’t include any finalizers~
https://github.com/argoproj/argo-cd/blob/414d9eb5dba31697f8782edeb54c7f4410623609/util/helm/cmd.go#L433-L433
~created like this, there is no finalizer field because I didn't specify finalizer in helm~
apiVersion: batch/v1
kind: Job
metadata:
annotations:
argocd.argoproj.io/tracking-id: test:batch/Job:default/job-test
helm.sh/hook: pre-upgrade
labels:
batch.kubernetes.io/controller-uid: 98cc82ed-86f8-4e9d-93b2-79de6cfcfd9c
batch.kubernetes.io/job-name: job-test
controller-uid: 98cc82ed-86f8-4e9d-93b2-79de6cfcfd9c
job-name: job-test
name: job-test
namespace: default
~But when the code below runs, since the YAML rendered by helm template command doesn’t have the finalizer, it doesn’t remove it—instead, it ends up adding it.~
https://github.com/argoproj/gitops-engine/blob/093aef0dad58015619479509f0d2ac71cc9cefd7/pkg/sync/sync_context.go#L789-L788
~so a job created by helm is not deleted~
@kingbj940429 Thank you for checking this out!
But we are not using helm to deploy the jobs. and still get the same issue. as I mentioned before the job/secret are deployed with
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
argocd.argoproj.io/sync-options: Force=true,Replace=true
argocd.argoproj.io/sync-wave: '-200' # depends on the resource
We are seeing this issue as well, we however are not using Force=true or Replace=true, but we do have argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
@crenshaw-dev
Could I get some help with this issue? I’ve tried a few things for fixing, but I’m not sure what else to do.
Many people are experiencing this issue.
Hi team,
Could we please get an update on this issue? It's currently impacting our workflow, and we’re evaluating the possibility of downgrading as a workaround.
Your guidance would be appreciated—thanks!
Hi,
We are having the same problem! is there any solution available for this problem?
Thanks
I haven't had time to drive into possible causes yet, but looks like we have enough information to write e2e tests producing at least two failure scenarios.
If anyone has time to set up those tests and open a draft PR to reproduce the issues, that would be greatly appreciated!
@crenshaw-dev
On it! I’ll draft the e2e tests and open a draft PR, just point me to the test harness or docs and I’ll dive in. Thanks!
Thanks! The docs here are reasonably up to date: https://github.com/argoproj/argo-cd/blob/master/test/e2e/hook_test.go
And you can probably copy/edit a test here: https://github.com/argoproj/argo-cd/blob/master/test/e2e/hook_test.go
@crenshaw-dev
I’m running into an issue: the end-to-end test that requires a repository is failing with:
InvalidSpecError: repository not accessible
It looks like something’s broken on my side—could someone else take over this test, or point me toward any missing config steps I might’ve overlooked? Thanks!
This consistently reproduces the issue: https://github.com/argoproj/argo-cd/pull/23697
The sync weight wasn't necessary, and the additional ConfigMap hook wasn't necessary.
We should fix the bug, but also... is Force=true,Replace=true actually useful here? The job gets deleted before a new sync starts, so there should never be an existing resource to replace.
The developers in my org don't have delete permission. So when for example the migration job fails they patch the code and deploy it.
Before adding those annotations the sync would fail because a job's image is immutable.
And with the annotations its recreated.
I don't understand how a PreSync hook with deletion policy BeforeHookCreation (the default) could ever encounter an update operation. Shouldn't the failed Job always be deleted and then the new configuration created?
@crenshaw-dev in our case, it was a faulty Kyverno mutating policy that wasn't configured to look only at create events
@PatTheSilent the faulty policy was causing the finalizer to not be removed, or it was causing you to need Force=true on the hook resource?
@crenshaw-dev I couldn’t reproduce this at first until I realized the team only checked the migrator job after the sync failed. They bumped the image tag, pushed a new release, and then ArgoCD hit this:
error when patching "/dev/shm/3141082373": Job.batch "migrator" is invalid: spec.template: Invalid value: core.PodTemplateSpec{…}: field is immutable
Reproduction steps:
- Apply the migrator setup below
- Let ArgoCD run the PreSync hook and fail
- Change the tag in
job/kustomization.yamland push - Watch the sync error about the immutable field
File Structure
migrator/
├── hello-world.yaml # Test job - Alpine container
├── job/
│ ├── job.yaml # Main migrator job
│ └── kustomization.yaml # Uses alpine:3.22 image
├── kustomization.yaml # Combines both jobs
└── argocd-application-config.json
File Contents
hello-world.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: hello-world
spec:
template:
spec:
containers:
- name: hello-world
image: alpine:latest
restartPolicy: Never
job/job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: migrator
spec:
backoffLimit: 1
template:
spec:
containers:
- name: migrator
image: CHANGEME
command:
- /bin/sh
- -c
- echo "Bacon pancakes Making Bacon Pancakes Take some bacon and ..." && sleep 1 && exit 1
restartPolicy: Never
job/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- job.yaml
images:
- name: CHANGEME
newName: alpine
newTag: "3.22" # change this to 3.21 and vice versa
commonAnnotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- hello-world.yaml
- job
argocd-application-config.json
{
"app_name": "migrator",
"namespace": "application",
"revision": "HEAD"
}
Since a dev without delete permission can’t clear the old job, we added:
argocd.argoproj.io/sync-options: Force=true,Replace=true
@PatTheSilent the faulty policy was causing the finalizer to not be removed, or it was causing you to need
Force=trueon the hook resource?
@crenshaw-dev It was causing the finalizer to not be removed. Each patch was going through Kyverno, which wanted to modify the job's spec, and it got rejected as the spec is immutable, so the patch that wanted to remove the finalizer was getting rejected.
@isarns it looks to me as if the more important bug is that Argo CD isn't deleting the hook resource before creating a new one. What happens if you use argocd.argoproj.io/hook-delete-policy: BeforeCreation instead of HookSucceeded?
@PatTheSilent do you have details about what fields Argo CD was trying to patch besides the finalizer? That sounds like a separate bug. The patch should be literally just the finalizer field.
Edit: It looks like the finalizer is being removed via an Update operation instead of a Patch operation. Is the Kyverno hook just rejecting all Updates?
@crenshaw-dev It was Kyverno that was trying to modify the Job's spec rather than ArgoCD. ArgoCD merely wanted to remove the hook but the hook removal was being rejected because Kyverno was modifying it with a mutation hook
mutate: {
patchStrategicMerge: {
spec: {
template: {
metadata: {
annotations: {
'cluster-autoscaler.kubernetes.io/safe-to-evict': 'false',
},
},
},
},
},
}
It was completely our fault, not blaming ArgoCD here. Just leaving this as someone's problem here might be caused by the same thing. In our case, we changed the policy to mutate Pods rather than Jobs.
Gotcha, thanks for clarifying!
So setting aside the problematic Kyverno hook, I think we're left with a hook whose finalizer removal is broken by Replace=true;Force=true, but it's not clear that we should even need force-replace in the first place.
I share my same experience, job deploy by helm. With Replace=true;Force=true because without, we got the "immutable error" when the job content change.
I also have kyverno policy, I will check it but this was working before updating to v3.