argo-cd Argo CD considers `PreSync` phase finished even though the Job was just created

Argo CD considers `PreSync` phase finished even though the Job was just created

Open nazarewk opened this issue 2 years ago • 13 comments

Checklist:

[x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[ ] I've included steps to reproduce the bug.
[X] I've pasted the output of argocd version.

Describe the bug

I've noticed a weird behavior blocking our deployments every now and then (couple times a day), where ArgoCD considers PreSync hook finished even though Job was just created. The Jobs Pod hangs up on lack of Secret (through ExternalSecret), that was deleted because PreSync supposedly finished.

I've asked about it on Slack at https://cloud-native.slack.com/archives/C01TSERG0KZ/p1658409808237339 , but didn't find the cause.

Example timeline

10:36:54 Argo CD created External Secret, ESO created a Secret
10:37:02 Argo dropped Job
10:37:04 Argo created a new Job
10:37:07 Argo deleted ExternalSecret
1. Secret was garbage collected, then ES recreated it for a split second
14:12:05 Argo cleaned up the Job (I might have terminated the Sync manually)

Extra info:

the Pod (and Job) got stuck indefinitely waiting for a Secret
i'm on Argo 2.3.3
ExternalSecret is created in wave -2, Job is at 20

Seems more or less related to:

https://github.com/argoproj/argo-cd/issues/9861
https://github.com/argoproj/argo-cd/issues/4384#issuecomment-696896836

To Reproduce

No idea, it happens randomly in the ballpark of once every 10 deployments

Expected behavior

Argo CD does not consider PreSync phase finished until the Job has started and completed (by either success or failure).

Version image: quay.io/argoproj/argocd:v2.3.3, not sure why argocd version posts 2.4.4 (it's the CLI's version).

Logs

Google Sheet gathering following information:

Kubernetes APIServer logs:
- all logs related to the application
- filtered down to creations & modifications
- filtered down to create/delete
- extracted suspicious part of logs
example ExternalSecret manifest
Application Controller logs

Job manifest

apiVersion: batch/v1
kind: Job
metadata:
  name: fa-app-migrate-kt
  labels:
    app.kubernetes.io/instance: fa-app
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: "git-22eREDACTED0d5"
    helm.sh/chart: app-REDACTED
    tags.datadoghq.com/env: "pr-prnum-ns"
    tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns"
    tags.datadoghq.com/service: "fa-app-migrate-kt"
    tags.datadoghq.com/version: "git-22eREDACTED0d5"
    tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt"
    tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5"
    image-tag: "git-22eREDACTED0d5"
    job-type: init
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
    helm.sh/hook-weight: "20"
spec:
  backoffLimit: 5
  template:
    metadata:
      labels:
        job-type: init
        app.kubernetes.io/instance: fa-app
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/version: "git-22eREDACTED0d5"
        helm.sh/chart: app-REDACTED
        
        tags.datadoghq.com/env: "pr-prnum-ns"
        tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns"
        tags.datadoghq.com/service: "fa-app-migrate-kt"
        tags.datadoghq.com/version: "git-22eREDACTED0d5"
        tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt"
        tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5"
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
        ad.datadoghq.com/fa-app-migrate-kt-copy.logs: "[{\n  \"source\": \"km-jobs\"\n}]\n"
        ad.datadoghq.com/fa-app-migrate-ktm.logs: "[{\n  \"source\": \"km-jobs\"\n}]\n"
    spec:
      restartPolicy: Never
      initContainers:
        - name: fa-app-migrate-kt-copy
          image: REDACTED/fa-app:git-22eREDACTED0d5
          command: [REDACTED]
          env:
            - name: APP_NAME
              value: fa-app-migrate-kt
          envFrom:
            - configMapRef:
                name: fa-app-hooks
            - secretRef:
                name: fa-app-hooks
          securityContext:
            allowPrivilegeEscalation: false
          volumeMounts:
            - name: kc
              mountPath: /kc
      containers:
        - name: fa-app-migrate-ktm
          image: "REDACTED/km:git-522c04babb111b298d6a897cf12960eb35868082"
          command: [REDACTED]
          env:
            - name: APP_NAME
              value: fa-app-migrate-kt
            - name: KT_FILE
              value: kc/kt.yaml
          
          envFrom:
            - configMapRef:
                name: fa-app-hooks
            - secretRef:
                name: fa-app-hooks
          resources:
            limits:
              cpu: 1
              memory: 512Mi
            requests:
              cpu: 0.5
              memory: 512Mi
          securityContext:
            allowPrivilegeEscalation: false
          volumeMounts:
            - name: kc
              mountPath: /usr/src/app/kc
      nodeSelector:
      
      tolerations:
      
      volumes:
        - name: kc
          emptyDir: {}

Jul 22 '22 08:07 nazarewk

I've tracked down some of the responsible code, but found nothing wrong with it:

https://github.com/argoproj/gitops-engine/blob/v0.6.2/pkg/sync/sync_context.go#L611-L621
https://github.com/argoproj/gitops-engine/blob/v0.6.2/pkg/sync/sync_context.go#L277-L300
https://github.com/argoproj/gitops-engine/blob/v0.6.2/pkg/health/health_job.go

Jul 22 '22 08:07 nazarewk

There is also Kubernetes side code for our version (EKS 1.20.5):

Pod Phase logic https://github.com/kubernetes/kubernetes/blob/6b1d87acf3c8253c123756b9e61dac642678305f/pkg/kubelet/kubelet_pods.go#L1378-L1378
Job Completed logic https://github.com/kubernetes/kubernetes/blob/6b1d87acf3c8253c123756b9e61dac642678305f/pkg/controller/job/job_controller.go#L537-L572

Jul 22 '22 08:07 nazarewk

Generally it seems like the only way:

Argo CD would not delete ExternalSecret (and by proxy a Secret) unless PreSync was finished
Argo CD would consider PreSync completed if all Jobs were completed
- Job was just created it could not be completed
Kubernetes would consider Job completed if there was at least 1 Pod that has finished pinned to it
Kubernetes would consider Pod finished when all containers both started and finished
- could not have happened, because Pod was just created

Jul 22 '22 08:07 nazarewk

Hi, can you please add the definition (YAML) of the PreSync job?

Jul 28 '22 04:07 OmerKahani

Hi, can you please add the definition (YAML) of the PreSync job?

done, you can find it collapsed at the bottom of description

Jul 28 '22 06:07 nazarewk

Looks like the issue happens less frequently after putting Argo CD on half number of twice as large instances and adjusting reservations and limits: screenshot-2022-07-28_09-39-54 I reconfigured the cluster around 13-15 on Monday 25th, then there were some trailing issues until 18 ~~and almost nothing after that~~. Might be related to some race conditions when performance is sub-optimal?

Jul 28 '22 07:07 nazarewk

Could it be that events are received/processed out of order?

Aug 09 '22 12:08 nazarewk

Another case of a hiccup. I suspect Argo CD application controller might not be handling cache properly under load? Seems like it thinks it has a job from 3 seconds before (before deletion due to hook deletion policy)?

screenshot-2022-08-12_12-42-15

seems like this is correlated with higher CPU usage on Application Controller: screenshot-2022-08-12_12-46-51

Aug 12 '22 10:08 nazarewk

Continuing on previous one there is patch on Application just before deletion of Secret, I don't have content of the patch, but I'm pretty sure it could be status update for PreSync completed. screenshot-2022-08-12_12-56-43

Aug 12 '22 10:08 nazarewk

maybe we could clear cache in here? https://github.com/argoproj/gitops-engine/blob/2bc3fef13e0712cf177ba6cbcfb52283f3d9ca73/pkg/sync/sync_context.go#L1085-L1112

Aug 12 '22 12:08 nazarewk

this might be caused by hardcoded (low amount of) processors, see https://github.com/argoproj/argo-cd/pull/10458 for a fix

Aug 31 '22 08:08 nazarewk

this is heavily related to:

https://github.com/argoproj/argo-cd/blob/9fac0f6ae6e52d6f4978a1eaaf51fbffb9c0958a/controller/sync.go#L465-L485 - there is a fix suggested
https://github.com/argoproj/argo-cd/issues/4669 - original issue
https://github.com/argoproj/argo-cd/pull/4715 - "good-enough" fix for the issue

Sep 13 '22 08:09 nazarewk

note: seems as if the issue stopped happening after we switched to rendering helm templates ahead of time (committing manifests to git + using directory source

Sep 13 '22 09:09 nazarewk

argo-cd argo-cd copied to clipboard

Argo CD considers `PreSync` phase finished even though the Job was just created

argo-cd
argo-cd copied to clipboard