argo-cd
argo-cd copied to clipboard
Argo CD considers `PreSync` phase finished even though the Job was just created
Checklist:
- [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- [ ] I've included steps to reproduce the bug.
- [X] I've pasted the output of
argocd version
.
Describe the bug
I've noticed a weird behavior blocking our deployments every now and then (couple times a day), where ArgoCD considers PreSync
hook finished even though Job
was just created. The Job
s Pod
hangs up on lack of Secret
(through ExternalSecret
), that was deleted because PreSync
supposedly finished.
I've asked about it on Slack at https://cloud-native.slack.com/archives/C01TSERG0KZ/p1658409808237339 , but didn't find the cause.
Example timeline
- 10:36:54 Argo CD created External Secret, ESO created a Secret
- 10:37:02 Argo dropped Job
- 10:37:04 Argo created a new Job
- 10:37:07 Argo deleted ExternalSecret
- Secret was garbage collected, then ES recreated it for a split second
- 14:12:05 Argo cleaned up the Job (I might have terminated the Sync manually)
Extra info:
- the Pod (and Job) got stuck indefinitely waiting for a Secret
- i'm on Argo 2.3.3
- ExternalSecret is created in wave -2, Job is at 20
Seems more or less related to:
- https://github.com/argoproj/argo-cd/issues/9861
- https://github.com/argoproj/argo-cd/issues/4384#issuecomment-696896836
To Reproduce
No idea, it happens randomly in the ballpark of once every 10 deployments
Expected behavior
Argo CD does not consider PreSync
phase finished until the Job
has started and completed (by either success or failure).
Version
image: quay.io/argoproj/argocd:v2.3.3
, not sure why argocd version posts 2.4.4 (it's the CLI's version).
Logs
Google Sheet gathering following information:
- Kubernetes APIServer logs:
- all logs related to the application
- filtered down to creations & modifications
- filtered down to create/delete
- extracted suspicious part of logs
- example ExternalSecret manifest
- Application Controller logs
Job manifest
apiVersion: batch/v1
kind: Job
metadata:
name: fa-app-migrate-kt
labels:
app.kubernetes.io/instance: fa-app
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/version: "git-22eREDACTED0d5"
helm.sh/chart: app-REDACTED
tags.datadoghq.com/env: "pr-prnum-ns"
tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns"
tags.datadoghq.com/service: "fa-app-migrate-kt"
tags.datadoghq.com/version: "git-22eREDACTED0d5"
tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt"
tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5"
image-tag: "git-22eREDACTED0d5"
job-type: init
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-delete-policy: before-hook-creation
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
helm.sh/hook-weight: "20"
spec:
backoffLimit: 5
template:
metadata:
labels:
job-type: init
app.kubernetes.io/instance: fa-app
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/version: "git-22eREDACTED0d5"
helm.sh/chart: app-REDACTED
tags.datadoghq.com/env: "pr-prnum-ns"
tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns"
tags.datadoghq.com/service: "fa-app-migrate-kt"
tags.datadoghq.com/version: "git-22eREDACTED0d5"
tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt"
tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5"
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
ad.datadoghq.com/fa-app-migrate-kt-copy.logs: "[{\n \"source\": \"km-jobs\"\n}]\n"
ad.datadoghq.com/fa-app-migrate-ktm.logs: "[{\n \"source\": \"km-jobs\"\n}]\n"
spec:
restartPolicy: Never
initContainers:
- name: fa-app-migrate-kt-copy
image: REDACTED/fa-app:git-22eREDACTED0d5
command: [REDACTED]
env:
- name: APP_NAME
value: fa-app-migrate-kt
envFrom:
- configMapRef:
name: fa-app-hooks
- secretRef:
name: fa-app-hooks
securityContext:
allowPrivilegeEscalation: false
volumeMounts:
- name: kc
mountPath: /kc
containers:
- name: fa-app-migrate-ktm
image: "REDACTED/km:git-522c04babb111b298d6a897cf12960eb35868082"
command: [REDACTED]
env:
- name: APP_NAME
value: fa-app-migrate-kt
- name: KT_FILE
value: kc/kt.yaml
envFrom:
- configMapRef:
name: fa-app-hooks
- secretRef:
name: fa-app-hooks
resources:
limits:
cpu: 1
memory: 512Mi
requests:
cpu: 0.5
memory: 512Mi
securityContext:
allowPrivilegeEscalation: false
volumeMounts:
- name: kc
mountPath: /usr/src/app/kc
nodeSelector:
tolerations:
volumes:
- name: kc
emptyDir: {}
I've tracked down some of the responsible code, but found nothing wrong with it:
- https://github.com/argoproj/gitops-engine/blob/v0.6.2/pkg/sync/sync_context.go#L611-L621
- https://github.com/argoproj/gitops-engine/blob/v0.6.2/pkg/sync/sync_context.go#L277-L300
- https://github.com/argoproj/gitops-engine/blob/v0.6.2/pkg/health/health_job.go
There is also Kubernetes side code for our version (EKS 1.20.5):
- Pod Phase logic https://github.com/kubernetes/kubernetes/blob/6b1d87acf3c8253c123756b9e61dac642678305f/pkg/kubelet/kubelet_pods.go#L1378-L1378
- Job Completed logic https://github.com/kubernetes/kubernetes/blob/6b1d87acf3c8253c123756b9e61dac642678305f/pkg/controller/job/job_controller.go#L537-L572
Generally it seems like the only way:
- Argo CD would not delete
ExternalSecret
(and by proxy aSecret
) unlessPreSync
was finished - Argo CD would consider
PreSync
completed if allJob
s were completed-
Job
was just created it could not be completed
-
- Kubernetes would consider
Job
completed if there was at least 1Pod
that has finished pinned to it - Kubernetes would consider
Pod
finished when all containers both started and finished- could not have happened, because Pod was just created
Hi, can you please add the definition (YAML) of the PreSync job?
Hi, can you please add the definition (YAML) of the PreSync job?
done, you can find it collapsed at the bottom of description
Looks like the issue happens less frequently after putting Argo CD on half number of twice as large instances and adjusting reservations and limits:
I reconfigured the cluster around 13-15 on Monday 25th, then there were some trailing issues until 18 ~~and almost nothing after that~~.
Might be related to some race conditions when performance is sub-optimal?
Could it be that events are received/processed out of order?
Another case of a hiccup. I suspect Argo CD application controller might not be handling cache properly under load? Seems like it thinks it has a job from 3 seconds before (before deletion due to hook deletion policy)?
seems like this is correlated with higher CPU usage on Application Controller:
Continuing on previous one there is patch
on Application
just before deletion of Secret, I don't have content of the patch, but I'm pretty sure it could be status update for PreSync completed.
maybe we could clear cache in here? https://github.com/argoproj/gitops-engine/blob/2bc3fef13e0712cf177ba6cbcfb52283f3d9ca73/pkg/sync/sync_context.go#L1085-L1112
this might be caused by hardcoded (low amount of) processors, see https://github.com/argoproj/argo-cd/pull/10458 for a fix
this is heavily related to:
- https://github.com/argoproj/argo-cd/blob/9fac0f6ae6e52d6f4978a1eaaf51fbffb9c0958a/controller/sync.go#L465-L485 - there is a fix suggested
- https://github.com/argoproj/argo-cd/issues/4669 - original issue
- https://github.com/argoproj/argo-cd/pull/4715 - "good-enough" fix for the issue
note: seems as if the issue stopped happening after we switched to rendering helm templates ahead of time (committing manifests to git + using directory
source