helm-charts
helm-charts copied to clipboard
[prometheus] Crashloop in Prometheus Server when upgrading due to file lock
Describe the bug a clear and concise description of what the bug is.
The default configuration of the Prometheus chart results in a crashloop in the Prometheus Server when upgrading. This is due to 2 factors that are enabled in the default configuration:
- the prometheus server runs as a deployment (can be changed using
server.statefulSet.enabled=true) - prometheus uses a file lock in its storage (didn't test, but looks like this can be disabled with
storage.tsdb.no-lockfileinserver.extraFlags)
The combination of these 2 causes an upgrade to fail because the old pod (which holds the lock) is only terminated when the new pod (waiting for the lock to be released) is ready.
I don't see any reason why server.statefulSet.enabled=true is not default behavior?
What's your helm version?
version.BuildInfo{Version:"v3.9.0", GitCommit:"7ceeda6c585217a19a1131663d8cd1f7d641b2a7", GitTreeState:"clean", GoVersion:"go1.17.5"}
What's your kubectl version?
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"3ddd0f45aa91e2f30c70734b175631bec5b5825a", GitTreeState:"clean", BuildDate:"2022-05-24T12:26:19Z", GoVersion:"go1.18.2", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}
Which chart?
prometheus
What's the chart version?
15.10.1
What happened?
A crashloop occurs:
NAME READY STATUS RESTARTS AGE
prometheus-alertmanager-67bf5f77bb-rd6tj 2/2 Running 0 17m
prometheus-kube-state-metrics-748fc7f64-m4mgb 1/1 Running 0 17m
prometheus-node-exporter-cnvl2 1/1 Running 0 17m
prometheus-pushgateway-b6c9dc7db-scgjz 1/1 Running 0 17m
prometheus-server-6bbf87b66f-rs2jr 2/2 Running 0 17m
prometheus-server-855d6fdfd9-tn9ht 1/2 CrashLoopBackOff 8 (50s ago) 16m
due to the following error in the updated pod:
ts=2022-06-17T12:40:51.887Z caller=main.go:516 level=info msg="Starting Prometheus" version="(version=2.34.0, branch=HEAD, revision=881111fec4332c33094a6fb2680c71fffc427275)"
ts=2022-06-17T12:40:51.887Z caller=main.go:521 level=info build_context="(go=go1.17.8, user=root@121ad7ea5487, date=20220315-15:18:00)"
ts=2022-06-17T12:40:51.888Z caller=main.go:522 level=info host_details="(Linux 5.13.0-48-generic #54~20.04.1-Ubuntu SMP Thu Jun 2 23:37:17 UTC 2022 x86_64 prometheus-server-855d6fdfd9-tn9ht (none))"
ts=2022-06-17T12:40:51.888Z caller=main.go:523 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-06-17T12:40:51.888Z caller=main.go:524 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-06-17T12:40:51.889Z caller=web.go:540 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2022-06-17T12:40:51.889Z caller=main.go:937 level=info msg="Starting TSDB ..."
ts=2022-06-17T12:40:51.889Z caller=dir_locker.go:77 level=warn component=tsdb msg="A lockfile from a previous execution already existed. It was replaced" file=/data/lock
ts=2022-06-17T12:40:51.889Z caller=main.go:799 level=info msg="Stopping scrape discovery manager..."
ts=2022-06-17T12:40:51.889Z caller=main.go:813 level=info msg="Stopping notify discovery manager..."
ts=2022-06-17T12:40:51.889Z caller=main.go:835 level=info msg="Stopping scrape manager..."
ts=2022-06-17T12:40:51.889Z caller=main.go:809 level=info msg="Notify discovery manager stopped"
ts=2022-06-17T12:40:51.889Z caller=main.go:795 level=info msg="Scrape discovery manager stopped"
ts=2022-06-17T12:40:51.889Z caller=manager.go:946 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-06-17T12:40:51.889Z caller=manager.go:956 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-06-17T12:40:51.889Z caller=notifier.go:600 level=info component=notifier msg="Stopping notification manager..."
ts=2022-06-17T12:40:51.889Z caller=main.go:1068 level=info msg="Notifier manager stopped"
ts=2022-06-17T12:40:51.889Z caller=main.go:829 level=info msg="Scrape manager stopped"
ts=2022-06-17T12:40:51.889Z caller=main.go:1077 level=error err="opening storage failed: lock DB directory: resource temporarily unavailable"
What you expected to happen?
I expect the default configuration to work out of the box.
How to reproduce it?
helm install prometheus prometheus-community/prometheus
kubectl set env deployment/prometheus-server test=xxx
Enter the changed values of values.yaml?
NONE
Enter the command that you execute and failing/misfunctioning.
helm install prometheus prometheus-community/prometheus kubectl set env deployment/prometheus-server test=xxx
Anything else we need to know?
No response
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
Though I didn't see this issue at upgrade of prometheus, it occurred on a running prometheus (which is already a statefulset) yesterday suddenly out of nowhere!
Message: ="Found healthy block" mint=1659859200000 maxt=1659866400000 ulid=01G9VZCCWFY7GZAQ3AS5G59MWH
ts=2022-08-07T18:50:49.999Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1659866400000 maxt=1659873600000 ulid=01G9W684526MMF69JZJKN15ZGX
ts=2022-08-07T18:50:49.999Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1659873600000 maxt=1659880800000 ulid=01G9WD3VCC7KEK6QKTFCFA5BN2
ts=2022-08-07T18:50:49.999Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1659880800000 maxt=1659888000000 ulid=01G9WKZJM4H6R8VR1B5MN75BAW
ts=2022-08-07T18:50:49.999Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false
ts=2022-08-07T18:50:50.001Z caller=dir_locker.go:77 level=warn component=tsdb msg="A lockfile from a previous execution already existed. It was replaced" file=/prometheus/lock
ts=2022-08-07T18:50:50.001Z caller=main.go:798 level=info msg="Stopping scrape discovery manager..."
ts=2022-08-07T18:50:50.001Z caller=main.go:812 level=info msg="Stopping notify discovery manager..."
ts=2022-08-07T18:50:50.001Z caller=main.go:834 level=info msg="Stopping scrape manager..."
ts=2022-08-07T18:50:50.001Z caller=main.go:808 level=info msg="Notify discovery manager stopped"
ts=2022-08-07T18:50:50.001Z caller=main.go:794 level=info msg="Scrape discovery manager stopped"
ts=2022-08-07T18:50:50.001Z caller=main.go:828 level=info msg="Scrape manager stopped"
ts=2022-08-07T18:50:50.001Z caller=manager.go:945 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-08-07T18:50:50.001Z caller=manager.go:955 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-08-07T18:50:50.001Z caller=notifier.go:600 level=info component=notifier msg="Stopping notification manager..."
ts=2022-08-07T18:50:50.001Z caller=main.go:1054 level=info msg="Notifier manager stopped"
ts=2022-08-07T18:50:50.001Z caller=main.go:1063 level=error err="opening storage failed: lock DB directory: resource temporarily unavailable"
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
This issue is being automatically closed due to inactivity.
Had the same issue
Similar issue but doesnt seem like anyone is helping
having the same issue with prometheus v2.36.2
also experiencing this on 2.36.1
having the same issue with prometheus v2.39.1
We also faced this issue (Statefulset is in use). We were able to mitigate this the following way:
- downscaling the Prometheus operator and Prometheus Statefulset
- mounting PVCs on temporary Pods
- removing the lock files (located on top level) on all Prometheus PVCs
- rescaling the operator and the Statefulset
This was due to rpc-statd not running on the node that the pod was scheduled on, so NFS locking doesn't work.
Check if that applies applies to you, if so just run sudo systemctl start rpc-statd
You might want to add some alerts to ensure that service stays up as well.
Это произошло из-за того, что rpc-statd не работал на узле, на котором был запланирован модуль, поэтому блокировка NFS не работает.
Проверьте, применимо ли это к вам, если да, просто запустите
sudo systemctl start rpc-statdВозможно, вы захотите добавить несколько оповещений, чтобы гарантировать, что служба также будет работать.
how to do this in docker? prometheus version 2.42
How to regenerate? you can easily do so by running the Prometheus Helm upgrade command upon specific changes to the deployment file.
How to resolve? Just simply rescale your prometheus-server deployment
Bug as per my findings? The Prometheus pod is already occupying its Persistent Volume (PV) and running smoothly. However, when upgrading the Helm chart due to changes in deployment, a new pod will be created. Since the previous pod is still occupying the PV, the new pod will be in a crashloop state as it cannot occupy the same PV. This situation can lead to a deadlock condition.