helm-charts
helm-charts copied to clipboard
[kube-prometheus-stack] prometheus pod not starting due to endless WAL recovery loop
Describe the bug a clear and concise description of what the bug is.
Issue is similar to https://github.com/prometheus-operator/prometheus-operator/issues/3391
It's in endless WAL recovery loop and failing on startup probe.
We have tried adding the following to the values.yaml.. but it does not take effect
livenessProbe:
failureThreshold: 1000
readinessProbe:
failureThreshold: 1000
startupProbe:
failureThreshold: 1000
What's your helm version?
v3.11.2
What's your kubectl version?
v1.25.2
Which chart?
kube-prometheus-stack
What's the chart version?
0.57.0
What happened?
Prometheus pod terminated and is unable to startup
What you expected to happen?
No response
How to reproduce it?
No response
Enter the changed values of values.yaml?
No response
Enter the command that you execute and failing/misfunctioning.
kubectl -n
Anything else we need to know?
No response
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
I think you are running into the following issue: https://github.com/prometheus/prometheus/issues/6934 Allocate slightly higher memory for prometheus during the startup by expanding the limits as a workaround. Something around the lines of
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: kube-prometheus-stack-prometheus
spec:
...
resources:
limits:
cpu: 3072m
memory: 18000Mi <- is dramatically higher to allow it some breathing room, or remove it entirely
requests:
cpu: 2048m
memory: 4096Mi
This is still an issue. Had the same problem prometheus container getting SIGTERM because it takes to long to start because of reading WAL files. Helm chart does not provide any setting to modify startupProbe which seems to default to 15mins.
Deleting the WAL files "fixed it", but it's not a solution.
Same issue here, we increased the Memory limit to 32GB and still we couldn't recover Prometheus, had to delete the previous data.
We still face this issue............ what to do to fix it ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
The issue is still here
I have the same issue chart name: prometheus chart version: 25.1.0
same issue. the pods limits are in the mid 20Gs. I cant even shell in to the pod to delete wal
I'm experiencing the same issue, in case of redeployment (i.e. config change) pod will often go to CrashLoopBackOff with error:
level=error err="opening storage failed: get segment range: segments are not sequential”
and, reliably the WAL folder will have file starting from 0:
our setup is Prometheus deployed on AKS with azureblob-fuse-premium persistent storage, and quite high resources:
limits: cpu: 2 memory: 18Gi requests: cpu: 100m memory: 4Gi
Prometheus deployed on AKS with azureblob-fuse-premium
That storage type is not supported from Prometheus.
You can increase the startupProbe by setting this: maximumStartupDurationSeconds
if you are using helm chart
https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml#L3993C5-L3993C34