argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

Argo CD considers `PreSync` phase finished even though the Job was just created

Open nazarewk opened this issue 2 years ago • 13 comments

Checklist:

  • [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [ ] I've included steps to reproduce the bug.
  • [X] I've pasted the output of argocd version.

Describe the bug

I've noticed a weird behavior blocking our deployments every now and then (couple times a day), where ArgoCD considers PreSync hook finished even though Job was just created. The Jobs Pod hangs up on lack of Secret (through ExternalSecret), that was deleted because PreSync supposedly finished.

I've asked about it on Slack at https://cloud-native.slack.com/archives/C01TSERG0KZ/p1658409808237339 , but didn't find the cause.

Example timeline

  1. 10:36:54 Argo CD created External Secret, ESO created a Secret
  2. 10:37:02 Argo dropped Job
  3. 10:37:04 Argo created a new Job
  4. 10:37:07 Argo deleted ExternalSecret
    1. Secret was garbage collected, then ES recreated it for a split second
  5. 14:12:05 Argo cleaned up the Job (I might have terminated the Sync manually)

Extra info:

  • the Pod (and Job) got stuck indefinitely waiting for a Secret
  • i'm on Argo 2.3.3
  • ExternalSecret is created in wave -2, Job is at 20

Seems more or less related to:

  • https://github.com/argoproj/argo-cd/issues/9861
  • https://github.com/argoproj/argo-cd/issues/4384#issuecomment-696896836

To Reproduce

No idea, it happens randomly in the ballpark of once every 10 deployments

Expected behavior

Argo CD does not consider PreSync phase finished until the Job has started and completed (by either success or failure).

Version image: quay.io/argoproj/argocd:v2.3.3, not sure why argocd version posts 2.4.4 (it's the CLI's version).

Logs

Google Sheet gathering following information:

  • Kubernetes APIServer logs:
    • all logs related to the application
    • filtered down to creations & modifications
    • filtered down to create/delete
    • extracted suspicious part of logs
  • example ExternalSecret manifest
  • Application Controller logs
Job manifest
apiVersion: batch/v1
kind: Job
metadata:
  name: fa-app-migrate-kt
  labels:
    app.kubernetes.io/instance: fa-app
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: "git-22eREDACTED0d5"
    helm.sh/chart: app-REDACTED
    tags.datadoghq.com/env: "pr-prnum-ns"
    tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns"
    tags.datadoghq.com/service: "fa-app-migrate-kt"
    tags.datadoghq.com/version: "git-22eREDACTED0d5"
    tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt"
    tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5"
    image-tag: "git-22eREDACTED0d5"
    job-type: init
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
    helm.sh/hook-weight: "20"
spec:
  backoffLimit: 5
  template:
    metadata:
      labels:
        job-type: init
        app.kubernetes.io/instance: fa-app
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/version: "git-22eREDACTED0d5"
        helm.sh/chart: app-REDACTED
        
        tags.datadoghq.com/env: "pr-prnum-ns"
        tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns"
        tags.datadoghq.com/service: "fa-app-migrate-kt"
        tags.datadoghq.com/version: "git-22eREDACTED0d5"
        tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt"
        tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5"
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
        ad.datadoghq.com/fa-app-migrate-kt-copy.logs: "[{\n  \"source\": \"km-jobs\"\n}]\n"
        ad.datadoghq.com/fa-app-migrate-ktm.logs: "[{\n  \"source\": \"km-jobs\"\n}]\n"
    spec:
      restartPolicy: Never
      initContainers:
        - name: fa-app-migrate-kt-copy
          image: REDACTED/fa-app:git-22eREDACTED0d5
          command: [REDACTED]
          env:
            - name: APP_NAME
              value: fa-app-migrate-kt
          envFrom:
            - configMapRef:
                name: fa-app-hooks
            - secretRef:
                name: fa-app-hooks
          securityContext:
            allowPrivilegeEscalation: false
          volumeMounts:
            - name: kc
              mountPath: /kc
      containers:
        - name: fa-app-migrate-ktm
          image: "REDACTED/km:git-522c04babb111b298d6a897cf12960eb35868082"
          command: [REDACTED]
          env:
            - name: APP_NAME
              value: fa-app-migrate-kt
            - name: KT_FILE
              value: kc/kt.yaml
          
          envFrom:
            - configMapRef:
                name: fa-app-hooks
            - secretRef:
                name: fa-app-hooks
          resources:
            limits:
              cpu: 1
              memory: 512Mi
            requests:
              cpu: 0.5
              memory: 512Mi
          securityContext:
            allowPrivilegeEscalation: false
          volumeMounts:
            - name: kc
              mountPath: /usr/src/app/kc
      nodeSelector:
      
      tolerations:
      
      volumes:
        - name: kc
          emptyDir: {}

nazarewk avatar Jul 22 '22 08:07 nazarewk

I've tracked down some of the responsible code, but found nothing wrong with it:

  • https://github.com/argoproj/gitops-engine/blob/v0.6.2/pkg/sync/sync_context.go#L611-L621
  • https://github.com/argoproj/gitops-engine/blob/v0.6.2/pkg/sync/sync_context.go#L277-L300
  • https://github.com/argoproj/gitops-engine/blob/v0.6.2/pkg/health/health_job.go

nazarewk avatar Jul 22 '22 08:07 nazarewk

There is also Kubernetes side code for our version (EKS 1.20.5):

  • Pod Phase logic https://github.com/kubernetes/kubernetes/blob/6b1d87acf3c8253c123756b9e61dac642678305f/pkg/kubelet/kubelet_pods.go#L1378-L1378
  • Job Completed logic https://github.com/kubernetes/kubernetes/blob/6b1d87acf3c8253c123756b9e61dac642678305f/pkg/controller/job/job_controller.go#L537-L572

nazarewk avatar Jul 22 '22 08:07 nazarewk

Generally it seems like the only way:

  1. Argo CD would not delete ExternalSecret (and by proxy a Secret) unless PreSync was finished
  2. Argo CD would consider PreSync completed if all Jobs were completed
    • Job was just created it could not be completed
  3. Kubernetes would consider Job completed if there was at least 1 Pod that has finished pinned to it
  4. Kubernetes would consider Pod finished when all containers both started and finished
    • could not have happened, because Pod was just created

nazarewk avatar Jul 22 '22 08:07 nazarewk

Hi, can you please add the definition (YAML) of the PreSync job?

OmerKahani avatar Jul 28 '22 04:07 OmerKahani

Hi, can you please add the definition (YAML) of the PreSync job?

done, you can find it collapsed at the bottom of description

nazarewk avatar Jul 28 '22 06:07 nazarewk

Looks like the issue happens less frequently after putting Argo CD on half number of twice as large instances and adjusting reservations and limits: screenshot-2022-07-28_09-39-54 I reconfigured the cluster around 13-15 on Monday 25th, then there were some trailing issues until 18 ~~and almost nothing after that~~. Might be related to some race conditions when performance is sub-optimal?

nazarewk avatar Jul 28 '22 07:07 nazarewk

Could it be that events are received/processed out of order?

nazarewk avatar Aug 09 '22 12:08 nazarewk

Another case of a hiccup. I suspect Argo CD application controller might not be handling cache properly under load? Seems like it thinks it has a job from 3 seconds before (before deletion due to hook deletion policy)?

screenshot-2022-08-12_12-42-15

seems like this is correlated with higher CPU usage on Application Controller: screenshot-2022-08-12_12-46-51

nazarewk avatar Aug 12 '22 10:08 nazarewk

Continuing on previous one there is patch on Application just before deletion of Secret, I don't have content of the patch, but I'm pretty sure it could be status update for PreSync completed. screenshot-2022-08-12_12-56-43

nazarewk avatar Aug 12 '22 10:08 nazarewk

maybe we could clear cache in here? https://github.com/argoproj/gitops-engine/blob/2bc3fef13e0712cf177ba6cbcfb52283f3d9ca73/pkg/sync/sync_context.go#L1085-L1112

nazarewk avatar Aug 12 '22 12:08 nazarewk

this might be caused by hardcoded (low amount of) processors, see https://github.com/argoproj/argo-cd/pull/10458 for a fix

nazarewk avatar Aug 31 '22 08:08 nazarewk

this is heavily related to:

  1. https://github.com/argoproj/argo-cd/blob/9fac0f6ae6e52d6f4978a1eaaf51fbffb9c0958a/controller/sync.go#L465-L485 - there is a fix suggested
  2. https://github.com/argoproj/argo-cd/issues/4669 - original issue
  3. https://github.com/argoproj/argo-cd/pull/4715 - "good-enough" fix for the issue

nazarewk avatar Sep 13 '22 08:09 nazarewk

note: seems as if the issue stopped happening after we switched to rendering helm templates ahead of time (committing manifests to git + using directory source

nazarewk avatar Sep 13 '22 09:09 nazarewk