helm-charts [prometheus] Crashloop in Prometheus Server when upgrading due to file lock

trafficstars

Describe the bug a clear and concise description of what the bug is.

The default configuration of the Prometheus chart results in a crashloop in the Prometheus Server when upgrading. This is due to 2 factors that are enabled in the default configuration:

the prometheus server runs as a deployment (can be changed using server.statefulSet.enabled=true)
prometheus uses a file lock in its storage (didn't test, but looks like this can be disabled with storage.tsdb.no-lockfile in server.extraFlags)

The combination of these 2 causes an upgrade to fail because the old pod (which holds the lock) is only terminated when the new pod (waiting for the lock to be released) is ready.

I don't see any reason why server.statefulSet.enabled=true is not default behavior?

What's your helm version?

version.BuildInfo{Version:"v3.9.0", GitCommit:"7ceeda6c585217a19a1131663d8cd1f7d641b2a7", GitTreeState:"clean", GoVersion:"go1.17.5"}

What's your kubectl version?

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"3ddd0f45aa91e2f30c70734b175631bec5b5825a", GitTreeState:"clean", BuildDate:"2022-05-24T12:26:19Z", GoVersion:"go1.18.2", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}

Which chart?

prometheus

What's the chart version?

15.10.1

What happened?

A crashloop occurs:

NAME                                            READY   STATUS             RESTARTS      AGE
prometheus-alertmanager-67bf5f77bb-rd6tj        2/2     Running            0             17m
prometheus-kube-state-metrics-748fc7f64-m4mgb   1/1     Running            0             17m
prometheus-node-exporter-cnvl2                  1/1     Running            0             17m
prometheus-pushgateway-b6c9dc7db-scgjz          1/1     Running            0             17m
prometheus-server-6bbf87b66f-rs2jr              2/2     Running            0             17m
prometheus-server-855d6fdfd9-tn9ht              1/2     CrashLoopBackOff   8 (50s ago)   16m

due to the following error in the updated pod:

ts=2022-06-17T12:40:51.887Z caller=main.go:516 level=info msg="Starting Prometheus" version="(version=2.34.0, branch=HEAD, revision=881111fec4332c33094a6fb2680c71fffc427275)"
ts=2022-06-17T12:40:51.887Z caller=main.go:521 level=info build_context="(go=go1.17.8, user=root@121ad7ea5487, date=20220315-15:18:00)"
ts=2022-06-17T12:40:51.888Z caller=main.go:522 level=info host_details="(Linux 5.13.0-48-generic #54~20.04.1-Ubuntu SMP Thu Jun 2 23:37:17 UTC 2022 x86_64 prometheus-server-855d6fdfd9-tn9ht (none))"
ts=2022-06-17T12:40:51.888Z caller=main.go:523 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-06-17T12:40:51.888Z caller=main.go:524 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-06-17T12:40:51.889Z caller=web.go:540 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2022-06-17T12:40:51.889Z caller=main.go:937 level=info msg="Starting TSDB ..."
ts=2022-06-17T12:40:51.889Z caller=dir_locker.go:77 level=warn component=tsdb msg="A lockfile from a previous execution already existed. It was replaced" file=/data/lock
ts=2022-06-17T12:40:51.889Z caller=main.go:799 level=info msg="Stopping scrape discovery manager..."
ts=2022-06-17T12:40:51.889Z caller=main.go:813 level=info msg="Stopping notify discovery manager..."
ts=2022-06-17T12:40:51.889Z caller=main.go:835 level=info msg="Stopping scrape manager..."
ts=2022-06-17T12:40:51.889Z caller=main.go:809 level=info msg="Notify discovery manager stopped"
ts=2022-06-17T12:40:51.889Z caller=main.go:795 level=info msg="Scrape discovery manager stopped"
ts=2022-06-17T12:40:51.889Z caller=manager.go:946 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-06-17T12:40:51.889Z caller=manager.go:956 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-06-17T12:40:51.889Z caller=notifier.go:600 level=info component=notifier msg="Stopping notification manager..."
ts=2022-06-17T12:40:51.889Z caller=main.go:1068 level=info msg="Notifier manager stopped"
ts=2022-06-17T12:40:51.889Z caller=main.go:829 level=info msg="Scrape manager stopped"
ts=2022-06-17T12:40:51.889Z caller=main.go:1077 level=error err="opening storage failed: lock DB directory: resource temporarily unavailable"

What you expected to happen?

I expect the default configuration to work out of the box.

How to reproduce it?

helm install prometheus prometheus-community/prometheus
kubectl set env deployment/prometheus-server test=xxx

Enter the changed values of values.yaml?

NONE

Enter the command that you execute and failing/misfunctioning.

helm install prometheus prometheus-community/prometheus kubectl set env deployment/prometheus-server test=xxx

Anything else we need to know?

No response

Jun 17 '22 12:06 DieterDP-ng

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

Jul 30 '22 16:07 stale[bot]

Though I didn't see this issue at upgrade of prometheus, it occurred on a running prometheus (which is already a statefulset) yesterday suddenly out of nowhere!

      Message:   ="Found healthy block" mint=1659859200000 maxt=1659866400000 ulid=01G9VZCCWFY7GZAQ3AS5G59MWH
ts=2022-08-07T18:50:49.999Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1659866400000 maxt=1659873600000 ulid=01G9W684526MMF69JZJKN15ZGX
ts=2022-08-07T18:50:49.999Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1659873600000 maxt=1659880800000 ulid=01G9WD3VCC7KEK6QKTFCFA5BN2
ts=2022-08-07T18:50:49.999Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1659880800000 maxt=1659888000000 ulid=01G9WKZJM4H6R8VR1B5MN75BAW
ts=2022-08-07T18:50:49.999Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false
ts=2022-08-07T18:50:50.001Z caller=dir_locker.go:77 level=warn component=tsdb msg="A lockfile from a previous execution already existed. It was replaced" file=/prometheus/lock
ts=2022-08-07T18:50:50.001Z caller=main.go:798 level=info msg="Stopping scrape discovery manager..."
ts=2022-08-07T18:50:50.001Z caller=main.go:812 level=info msg="Stopping notify discovery manager..."
ts=2022-08-07T18:50:50.001Z caller=main.go:834 level=info msg="Stopping scrape manager..."
ts=2022-08-07T18:50:50.001Z caller=main.go:808 level=info msg="Notify discovery manager stopped"
ts=2022-08-07T18:50:50.001Z caller=main.go:794 level=info msg="Scrape discovery manager stopped"
ts=2022-08-07T18:50:50.001Z caller=main.go:828 level=info msg="Scrape manager stopped"
ts=2022-08-07T18:50:50.001Z caller=manager.go:945 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-08-07T18:50:50.001Z caller=manager.go:955 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-08-07T18:50:50.001Z caller=notifier.go:600 level=info component=notifier msg="Stopping notification manager..."
ts=2022-08-07T18:50:50.001Z caller=main.go:1054 level=info msg="Notifier manager stopped"
ts=2022-08-07T18:50:50.001Z caller=main.go:1063 level=error err="opening storage failed: lock DB directory: resource temporarily unavailable"

Aug 08 '22 07:08 AshutoshNirkhe

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

Sep 16 '22 01:09 stale[bot]

This issue is being automatically closed due to inactivity.

Oct 12 '22 10:10 stale[bot]

Had the same issue

Nov 14 '22 15:11 a0s

Similar issue but doesnt seem like anyone is helping

Nov 18 '22 03:11 tanken

having the same issue with prometheus v2.36.2

Dec 14 '22 14:12 kgogolek

also experiencing this on 2.36.1

Dec 15 '22 00:12 phyzical

having the same issue with prometheus v2.39.1

Dec 30 '22 09:12 guobei2028

We also faced this issue (Statefulset is in use). We were able to mitigate this the following way:

downscaling the Prometheus operator and Prometheus Statefulset
mounting PVCs on temporary Pods
removing the lock files (located on top level) on all Prometheus PVCs
rescaling the operator and the Statefulset

Apr 28 '23 11:04 steadyk

This was due to rpc-statd not running on the node that the pod was scheduled on, so NFS locking doesn't work.

Check if that applies applies to you, if so just run sudo systemctl start rpc-statd

You might want to add some alerts to ensure that service stays up as well.

Jul 24 '23 16:07 CGKeeley

Это произошло из-за того, что rpc-statd не работал на узле, на котором был запланирован модуль, поэтому блокировка NFS не работает.

Проверьте, применимо ли это к вам, если да, просто запуститеsudo systemctl start rpc-statd

Возможно, вы захотите добавить несколько оповещений, чтобы гарантировать, что служба также будет работать.

how to do this in docker? prometheus version 2.42

Jan 10 '24 04:01 f33r0

How to regenerate? you can easily do so by running the Prometheus Helm upgrade command upon specific changes to the deployment file.

How to resolve? Just simply rescale your prometheus-server deployment

Bug as per my findings? The Prometheus pod is already occupying its Persistent Volume (PV) and running smoothly. However, when upgrading the Helm chart due to changes in deployment, a new pod will be created. Since the previous pod is still occupying the PV, the new pod will be in a crashloop state as it cannot occupy the same PV. This situation can lead to a deadlock condition.

Mar 01 '24 05:03 sanskar153

helm-charts helm-charts copied to clipboard

[prometheus] Crashloop in Prometheus Server when upgrading due to file lock

Describe the bug a clear and concise description of what the bug is.

What's your helm version?

What's your kubectl version?

Which chart?

What's the chart version?

What happened?

What you expected to happen?

How to reproduce it?

Enter the changed values of values.yaml?

Enter the command that you execute and failing/misfunctioning.

Anything else we need to know?

helm-charts
helm-charts copied to clipboard