helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[kube-prometheus-stack] prometheus pod goes terminated - completed after some time

Open anthonyrfarias opened this issue 1 year ago • 6 comments

Describe the bug a clear and concise description of what the bug is.

I'm having issues using this chart. After installation everything works smoothly, until some short time goes by, like 1 hour, and the prometheus/grafana pod goes to status terminated - completed and it stops gathering metrics:

image

This is how I installed the helm chart:

helm install -f /home/anthony/proyectos/multivende/main_to_publish/k8s/miscellaneous/production/values.yml prom-grafana  prometheus-community/kube-prometheus-stack

These are my values:

prometheus:
  server:
    persistentVolume:
      enabled: true
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 600Gi

grafana:
  enabled: true
  persistence:
    enabled: true
    type: pvc
    storageClassName: gp2
    accessModes:
      - ReadWriteOnce
    size: 600Gi
    finalizers:
      - kubernetes.io/aws-ebs

What's your helm version?

version.BuildInfo{Version:"v3.5.2", GitCommit:"167aac70832d3a384f65f9745335e9fb40169dc2", GitTreeState:"dirty", GoVersion:"go1.15.7"}

What's your kubectl version?

Client Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:21:03Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.8-eks-8cb36c9", GitCommit:"fca3a8722c88c4dba573a903712a6feaf3c40a51", GitTreeState:"clean", BuildDate:"2023-11-22T21:52:13Z", GoVersion:"go1.20.11", Compiler:"gc", Platform:"linux/amd64"}

Which chart?

prometheus-community/kube-prometheus-stack

What's the chart version?

latest

What happened?

No response

What you expected to happen?

I expected the pod to be running all the time.

How to reproduce it?

No response

Enter the changed values of values.yaml?

No response

Enter the command that you execute and failing/misfunctioning.

prometheus:
  server:
    persistentVolume:
      enabled: true
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 600Gi

grafana:
  enabled: true
  persistence:
    enabled: true
    type: pvc
    storageClassName: gp2
    accessModes:
      - ReadWriteOnce
    size: 600Gi
    finalizers:
      - kubernetes.io/aws-ebs


### Anything else we need to know?

_No response_

anthonyrfarias avatar Jan 22 '24 13:01 anthonyrfarias

Hej, i just want to let you know that i am fighting with nearly the same issue. In my case the operator throws some errors that a serviceAccount is missing. After that the operator get killed and the deployment of prometheus and grafana disappears.

mladBlum avatar Feb 07 '24 09:02 mladBlum

Hej, i just want to let you know that i am fighting with the same nearly the same issue. In my case the operator throws some errors that a serviceAccount is missing. After that the operator get killed and the deployment of prometheus and grafana disappears.

This seems to be different than my problem. I kind of "fixed" it by creating a cronjob that removes completed pods and they recreate again and continue to gather metrics. You seem to have some sort of permission issues. Check that your service accounts do exist and have the right permissions.

What is the exact Helm chart version?

aabouzaid avatar Feb 21 '24 15:02 aabouzaid

What is the exact Helm chart version?

It's kube-prometheus-stack-54.1.0

I'm seeing the same with 56.21.3 deployed with

grafana:
  additionalDataSources:
  - access: proxy
    jsonData:
      maxLines: 1000
      tlsSkipVerify: true
    name: Loki
    type: loki
    url: http://loki.loki.svc.cluster.local:3100
  defaultDashboardsEnabled: true
  persistence:
    enabled: true
    size: 500Mi
  sidecar:
    datasources:
      enabled: true
      label: grafana_datasource
prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
    - job_name: gpu-metrics
      kubernetes_sd_configs:
      - namespaces:
          names:
          - gpu-operator
        role: endpoints
      metrics_path: /metrics
      relabel_configs:
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: kubernetes_node
      scheme: http
      scrape_interval: 1s
    podMonitorSelectorNilUsesHelmValues: false
    probeSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false
    serviceMonitorSelectorNilUsesHelmValues: false

artgillespie avatar Mar 18 '24 15:03 artgillespie

I've upgraded the chart to v57.0.2 (latest) and it's been working fine for some time now.

aabouzaid avatar Mar 19 '24 19:03 aabouzaid