helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[kube-prometheus-stack] deployment via argocd stuck in "pending deletion" for kube-prometheus-stack-admission-create job

Open jkleinlercher opened this issue 9 months ago • 1 comments

Describe the bug a clear and concise description of what the bug is.

When deploying kube-prometheus-stack with argocd sometimes it gets stuck in "pending deletion" for object job kube-prometheus-stack-admission-create. We need to terminate the argocd sync and resync manually to get out of this situation, which is obviously not good in terms of full automation.

image

According to https://github.com/argoproj/argo-cd/issues/6880 this is a well known problem when a PreHook Job has a

ttlSecondsAfterFinished: 0

defined, because then kubernetes deletes this job immediatly after it finished and also argocd wants to delete this job and so we have some kind of race condition here.

I would propose to make the value for ttlSecondsAfterFinished in https://github.com/prometheus-community/helm-charts/blob/9c41858ac9714483638d78fb560577dc37e55875/charts/kube-prometheus-stack/templates/prometheus-operator/admission-webhooks/job-patch/job-createSecret.yaml#L19 configurable via a helm value.

What's your helm version?

3.14.3

What's your kubectl version?

v1.29.2

Which chart?

kube-prometheus-stack

What's the chart version?

58.2.2

What happened?

When deploying kube-prometheus-stack with argocd sometimes it gets stuck in "pending deletion" for object job kube-prometheus-stack-admission-create. We need to terminate the argocd sync and resync manually to get out of this situation, which is obviously not good in terms of full automation.

What you expected to happen?

the argocd sync of this chart works without failures are stuck problems

How to reproduce it?

setup an argocd application with

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: kube-prometheus-stack-jokl
spec:
  destination:
    name: ''
    namespace: jokl-test
    server: 'https://kubernetes.default.svc'
  source:
    path: ''
    repoURL: 'https://prometheus-community.github.io/helm-charts'
    targetRevision: 58.2.2
    chart: kube-prometheus-stack
  sources: []
  project: default
  syncPolicy:
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

since it is a race condition it doesn't happen always, but very often

Enter the changed values of values.yaml?

NONE

Enter the command that you execute and failing/misfunctioning.

n.a.

Anything else we need to know?

No response

jkleinlercher avatar Apr 30 '24 14:04 jkleinlercher

Right after creating this issue and creating a PR for this issue I recognized that ttlSecondsAfterFinished is not set at all because there was a "batch/v1alpha1" condition around this attribute, which is not met in current clusters

https://github.com/prometheus-community/helm-charts/blob/9c41858ac9714483638d78fb560577dc37e55875/charts/kube-prometheus-stack/templates/prometheus-operator/admission-webhooks/job-patch/job-createSecret.yaml#L17-L20

... so I need to investigate again .. I leave this issue open but please consider it as still under investigations

jkleinlercher avatar Apr 30 '24 14:04 jkleinlercher

Funnily enough we never ran into this issue before this change, but now we need to set the ttlSecondsAfterFinished because we consistently get this error 😂

rouke-broersma avatar May 01 '24 09:05 rouke-broersma