helm-charts [kube-prometheus-stack] prometheus pod not starting due to endless WAL recovery loop

Describe the bug a clear and concise description of what the bug is.

Issue is similar to https://github.com/prometheus-operator/prometheus-operator/issues/3391

It's in endless WAL recovery loop and failing on startup probe.

We have tried adding the following to the values.yaml.. but it does not take effect

livenessProbe:
  failureThreshold: 1000
readinessProbe:
  failureThreshold: 1000
startupProbe:
  failureThreshold: 1000

What's your helm version?

v3.11.2

What's your kubectl version?

v1.25.2

Which chart?

kube-prometheus-stack

What's the chart version?

0.57.0

What happened?

Prometheus pod terminated and is unable to startup

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

No response

Enter the command that you execute and failing/misfunctioning.

kubectl -n rollout restart statefulset prometheus-kube-prometheus-stack-prometheus

Anything else we need to know?

No response

Mar 18 '23 01:03 thomas-vt

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

Apr 26 '23 06:04 stale[bot]

I think you are running into the following issue: https://github.com/prometheus/prometheus/issues/6934 Allocate slightly higher memory for prometheus during the startup by expanding the limits as a workaround. Something around the lines of

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: kube-prometheus-stack-prometheus
spec:
 ...
  resources:
    limits:
      cpu: 3072m
      memory: 18000Mi  <- is dramatically higher to allow it some breathing room, or remove it entirely
    requests:
      cpu: 2048m
      memory: 4096Mi

Apr 26 '23 12:04 xgt001

This is still an issue. Had the same problem prometheus container getting SIGTERM because it takes to long to start because of reading WAL files. Helm chart does not provide any setting to modify startupProbe which seems to default to 15mins.

Deleting the WAL files "fixed it", but it's not a solution.

Jun 07 '23 12:06 nilsbillo

Same issue here, we increased the Memory limit to 32GB and still we couldn't recover Prometheus, had to delete the previous data.

Jul 17 '23 09:07 vBitza

We still face this issue............ what to do to fix it ?

Aug 03 '23 13:08 Shahard2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

Sep 17 '23 06:09 stale[bot]

The issue is still here

Nov 28 '23 17:11 a0s

I have the same issue chart name: prometheus chart version: 25.1.0

Nov 28 '23 21:11 iateya

same issue. the pods limits are in the mid 20Gs. I cant even shell in to the pod to delete wal

Mar 12 '24 13:03 cody-amtote

I'm experiencing the same issue, in case of redeployment (i.e. config change) pod will often go to CrashLoopBackOff with error:

level=error err="opening storage failed: get segment range: segments are not sequential”

and, reliably the WAL folder will have file starting from 0:

our setup is Prometheus deployed on AKS with azureblob-fuse-premium persistent storage, and quite high resources: limits: cpu: 2 memory: 18Gi requests: cpu: 100m memory: 4Gi

Apr 08 '24 06:04 SensoryDeprivation

Prometheus deployed on AKS with azureblob-fuse-premium

That storage type is not supported from Prometheus.

Apr 08 '24 11:04 jkroepke

You can increase the startupProbe by setting this: maximumStartupDurationSeconds if you are using helm chart

https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml#L3993C5-L3993C34

Apr 16 '24 13:04 poornima-krishnasamy

helm-charts helm-charts copied to clipboard

[kube-prometheus-stack] prometheus pod not starting due to endless WAL recovery loop

Describe the bug a clear and concise description of what the bug is.

What's your helm version?

What's your kubectl version?

Which chart?

What's the chart version?

What happened?

What you expected to happen?

How to reproduce it?

Enter the changed values of values.yaml?

Enter the command that you execute and failing/misfunctioning.

Anything else we need to know?

helm-charts
helm-charts copied to clipboard