helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[kube-prometheus-stack] grafana: Readiness probe failed: connect: connection refused

Open AndreasMurk opened this issue 1 year ago • 3 comments
trafficstars

Describe the bug a clear and concise description of what the bug is.

Hi!

I have deployed the kube-prometheus-stack using FluxCD with the latest 56.6.2 version.

Prometheus along with Loki works fine. However, Grafana has some problems after a while.

It lasted approximately 60 minutes to start up fully until all migrations have been done. Then, whenever I make changes in the Dashboard (eg. adding a new data source) the pod fails. After inspecting the logs I have found these error messages:

{"time": "2024-02-14T15:50:37.062173+00:00", "taskName": null, "msg": "Writing /tmp/dashboards/apiserver.json (ascii)", "level": "INFO"}
{"time": "2024-02-14T15:50:37.065761+00:00", "taskName": null, "msg": "Retrying (Retry(total=4, connect=9, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f8f80>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/dashboards/reload", "level": "WARNING"}
{"time": "2024-02-14T15:50:39.266982+00:00", "taskName": null, "msg": "Retrying (Retry(total=3, connect=8, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f90a0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/dashboards/reload", "level": "WARNING"}
{"time": "2024-02-14T15:50:43.669076+00:00", "taskName": null, "msg": "Retrying (Retry(total=2, connect=7, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f9340>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/dashboards/reload", "level": "WARNING"}
{"time": "2024-02-14T15:50:52.471752+00:00", "taskName": null, "msg": "Retrying (Retry(total=1, connect=6, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f96a0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/dashboards/reload", "level": "WARNING"}
{"time": "2024-02-14T15:51:10.074029+00:00", "taskName": null, "msg": "Retrying (Retry(total=0, connect=5, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f9820>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/dashboards/reload", "level": "WARNING"}
{"time": "2024-02-14T15:51:10.076283+00:00", "taskName": null, "msg": "Received unknown exception: HTTPConnectionPool(host='localhost', port=3000): Max retries exceeded with url: /api/admin/provisioning/dashboards/reload (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffaff8f9a90>: Failed to establish a new connection: [Errno 111] Connection refused'))\n", "level": "ERROR"}
Traceback (most recent call last):
  File "/app/.venv/lib/python3.12/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/app/.venv/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The pod tries to restart but fails with the aformentioned bug. In Lens it always says: Readiness probe failed: Get "http://192.168.1.247:3000/api/health": dial tcp 192.168.1.247:3000: connect: connection refused

What's your helm version?

3.14.0

What's your kubectl version?

1.29.1

Which chart?

kube-prometheus-stack

What's the chart version?

56.6.2

What happened?

Making changes in the Dashboard (eg. adding new data sources such as Loki) fails with the stated Python error.

What I have also encountered is that since the newest release, the Dashboard seems slower than with previous releases.

What you expected to happen?

Dashboard should correctly set the datasource

How to reproduce it?

  1. Enable Grafana and Loki in values.yaml
  2. Deploy using FluxCD or helm
  3. Add new Loki Datasource
  4. Check if Dashboard / Pod is still running
  5. Additionally check logs

Enter the changed values of values.yaml?

prometheus: ingress: enabled: true annotations: cert-manager.io/cluster-issuer: "letsencrypt-issuer" kubernetes.io/ingressClassName: nginx nginx.ingress.kubernetes.io/service-upstream: "true"

  # nginx-http-auth config:
nginx.ingress.kubernetes.io/auth-type: basic
  # the name of the secret that contains the htpasswd hash (has to exist beforehand)
nginx.ingress.kubernetes.io/auth-secret: prometheus-htpasswd
  # message to display on auth missing:
nginx.ingress.kubernetes.io/auth-realm: 'Authentication Required - Prometheus'

hosts:
  - prometheus.xxx

path: /
service:
  name: prometheus-prometheus-kube-prometheus-prometheus
  port: 9090
  tls:
    - secretName: prometheus-prod-secret
      hosts:
        - prometheus.xxx
  prometheusSpec:
    replicas: 1
    retention: 168h

    walCompression: true
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: "myBlock"
          resources:
            requests:
              storage: 50Gi

                # scrape all service monitorings without correct labeling
              podMonitorSelectorNilUsesHelmValues: false
              serviceMonitorSelectorNilUsesHelmValues: false
              grafana:
                admin:
                  existingSecret: grafana-admin-secret
              userKey: admin-user
              passwordKey: admin-password
              ingress:
                enabled: true
              annotations: 
              cert-manager.io/cluster-issuer: "letsencrypt-issuer"
              kubernetes.io/ingress.class: nginx
              nginx.ingress.kubernetes.io/service-upstream: "true"
                # nginx-http-auth config:
              nginx.ingress.kubernetes.io/auth-type: basic
                # the name of the secret that contains the htpasswd hash (has to exist beforehand)
              nginx.ingress.kubernetes.io/auth-secret: prometheus-htpasswd
                # message to display on auth missing:
              nginx.ingress.kubernetes.io/auth-realm: 'Authentication Required - Grafana'
              hosts:
                - grafana.xxx
              path: /
              service:
                name: prometheus-grafana
              port: 3000
              tls:
                - secretName: grafana-xxx
              hosts:
                - grafana.xxx
              persistence:
                enabled: true
              type: pvc
              size: 10Gi
              storageClassName: "myStorageClass"

Enter the command that you execute and failing/misfunctioning.

helm install prometheus prometheus-community/kube-prometheus-stack --values values.yaml

Anything else we need to know?

No response

AndreasMurk avatar Feb 15 '24 08:02 AndreasMurk

I got this error because the pod couldn't write to the persistent storage location.

mschaefer-gresham avatar Feb 29 '24 15:02 mschaefer-gresham

same issue here

repositories:
- name: prometheus-community 
  url: https://prometheus-community.github.io/helm-charts 


releases:
- name: kube-prometheus-stack
  namespace: monitoring
  chart: prometheus-community/kube-prometheus-stack
  version: 56.20.0
  installed: true
  values:
    - values.yaml

martinbe1io avatar Mar 10 '24 13:03 martinbe1io