pmm
pmm copied to clipboard
PMM Server 2.36.0 can not restart successfully with pg failed.
Description
I installed pxc-operator and pmm-server using helm-chart 1.12.1. When the pmm was first deployed, it started correctly. When the pod restarted, I found that the pg service was still failing.
2023-04-11 10:18:28,393 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:28,546 INFO exited: qan-api2 (exit status 1; not expected)
2023-04-11 10:18:29,261 INFO success: pmm-update-perform-init entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,261 INFO success: clickhouse entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,261 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,261 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,292 INFO success: victoriametrics entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,308 INFO success: vmalert entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,308 INFO success: alertmanager entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,308 INFO success: vmproxy entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,308 INFO success: pmm-managed entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,308 INFO success: pmm-agent entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:29,422 INFO spawned: 'postgresql' with pid 153
2023-04-11 10:18:29,450 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:29,568 INFO spawned: 'qan-api2' with pid 155
2023-04-11 10:18:30,561 INFO success: qan-api2 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-04-11 10:18:31,570 INFO spawned: 'postgresql' with pid 185
2023-04-11 10:18:31,942 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:35,111 INFO spawned: 'postgresql' with pid 231
2023-04-11 10:18:35,344 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:39,669 INFO spawned: 'postgresql' with pid 260
2023-04-11 10:18:39,833 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:44,966 INFO spawned: 'postgresql' with pid 344
2023-04-11 10:18:45,090 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:46,956 INFO exited: pmm-update-perform-init (exit status 0; expected)
2023-04-11 10:18:52,051 INFO spawned: 'postgresql' with pid 396
2023-04-11 10:18:52,090 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:18:59,145 INFO spawned: 'postgresql' with pid 397
2023-04-11 10:18:59,183 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:19:07,246 INFO spawned: 'postgresql' with pid 399
2023-04-11 10:19:07,269 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:19:16,478 INFO spawned: 'postgresql' with pid 402
2023-04-11 10:19:16,497 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:19:26,712 INFO spawned: 'postgresql' with pid 404
2023-04-11 10:19:26,734 INFO exited: postgresql (exit status 1; not expected)
2023-04-11 10:19:27,713 INFO gave up: postgresql entered FATAL state, too many start retries too quickly
I checked pg logs in/src/logs
and found that the pg directory permissions is not correct.
2023-04-11 10:18:52.087 UTC [396] FATAL: data directory "/srv/postgres14" has invalid permissions
2023-04-11 10:18:52.087 UTC [396] DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2023-04-11 10:18:59.179 UTC [397] FATAL: data directory "/srv/postgres14" has invalid permissions
2023-04-11 10:18:59.179 UTC [397] DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2023-04-11 10:19:07.267 UTC [399] FATAL: data directory "/srv/postgres14" has invalid permissions
2023-04-11 10:19:07.267 UTC [399] DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2023-04-11 10:19:16.495 UTC [402] FATAL: data directory "/srv/postgres14" has invalid permissions
2023-04-11 10:19:16.495 UTC [402] DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2023-04-11 10:19:26.731 UTC [404] FATAL: data directory "/srv/postgres14" has invalid permissions
2023-04-11 10:19:26.731 UTC [404] DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
I used the following commands to change the pg directory permissions and start pg. Pg started after the first change. But after I tried to restart pod, the directory permissions were forced to change by an unknown script or program. The repetition caused the above exception.
chmod 700 -R /srv/postgres14
su postgres -c "/usr/pgsql-14/bin/pg_ctl start -D /srv/postgres14"
Expected Results
Directory permission for postgres should not change, which is a mandatory restriction for pg startup.
Actual Results
pg directory permissions should not be changed.
Version
pmm-server and client 2.36. OKD 4.11
Steps to reproduce
No response
Relevant logs
I had checked /srv
permissions and I found that:
drwxrwsr-x. 13 root pmm 4096 Apr 6 03:30 .
dr-xr-xr-x. 1 root root 62 Apr 12 08:28 ..
drwxrwsr-x. 3 root pmm 4096 Apr 6 03:29 alerting
drwxrwsr-x. 4 pmm pmm 4096 Apr 6 03:29 alertmanager
drwxrwsr-x. 2 root pmm 4096 Apr 6 03:30 backup
drwxrwsr-x. 13 root pmm 4096 Apr 12 08:28 clickhouse
drwxrwsr-x. 6 grafana pmm 4096 Apr 12 08:28 grafana
drwxrwsr-x. 2 pmm pmm 4096 Apr 12 08:23 logs
drwxrws---. 2 root pmm 16384 Apr 6 03:29 lost+found
drwxrwsr-x. 2 root pmm 4096 Apr 6 03:29 nginx
-rw-rw-r--. 1 root pmm 7 Apr 6 03:29 pmm-distribution
drwxrws---. 20 postgres pmm 4096 Apr 12 00:00 postgres14
drwxrwsr-x. 3 pmm pmm 4096 Apr 6 03:29 prometheus
drwxrwsr-x. 3 pmm pmm 4096 Apr 6 03:29 victoriametrics
Code of Conduct
- [X] I agree to follow Percona Community Code of Conduct
I also tried to change the pg directory permissions and rename it, and found that the permissions were still changed after restarting pod. Did a script or program force the folder permissions to be updated?
Before restart:
drwxrwsr-x. 3 root pmm 4096 Apr 6 03:29 alerting
drwxrwsr-x. 4 pmm pmm 4096 Apr 6 03:29 alertmanager
drwxrwsr-x. 2 root pmm 4096 Apr 6 03:30 backup
drwxrwsr-x. 13 root pmm 4096 Apr 12 08:57 clickhouse
drwxrwsr-x. 6 grafana pmm 4096 Apr 12 08:57 grafana
drwxrwsr-x. 2 pmm pmm 4096 Apr 12 08:23 logs
drwxrws---. 2 root pmm 16384 Apr 6 03:29 lost+found
drwxrwsr-x. 2 root pmm 4096 Apr 6 03:29 nginx
-rw-rw-r--. 1 root pmm 7 Apr 6 03:29 pmm-distribution
drwx--S---. 20 postgres pmm 4096 Apr 12 00:00 postgres14-bak
drwxrwsr-x. 3 pmm pmm 4096 Apr 6 03:29 prometheus
drwxrwsr-x. 3 pmm pmm 4096 Apr 6 03:29 victoriametrics
After restart pod:
drwxrwsr-x. 3 root pmm 4096 Apr 6 03:29 alerting
drwxrwsr-x. 4 pmm pmm 4096 Apr 6 03:29 alertmanager
drwxrwsr-x. 2 root pmm 4096 Apr 6 03:30 backup
drwxrwsr-x. 13 root pmm 4096 Apr 12 09:05 clickhouse
drwxrwsr-x. 6 grafana pmm 4096 Apr 12 09:05 grafana
drwxrwsr-x. 2 pmm pmm 4096 Apr 12 08:23 logs
drwxrws---. 2 root pmm 16384 Apr 6 03:29 lost+found
drwxrwsr-x. 2 root pmm 4096 Apr 6 03:29 nginx
-rw-rw-r--. 1 root pmm 7 Apr 6 03:29 pmm-distribution
drwxrws---. 20 postgres pmm 4096 Apr 12 00:00 postgres14-bak
drwxrwsr-x. 3 pmm pmm 4096 Apr 6 03:29 prometheus
drwxrwsr-x. 3 pmm pmm 4096 Apr 6 03:29 victoriametrics
Hi @cdmikechen, what version of a helm chart (pmm chart version) and repo do you use for PMM?
There are couple of things that could change those permissions - init container, storage provisioner or some update procedure.
As you said you use OKD - we don't officially support OpenShift yet as PMM requires root in the container.
Why pod was restarted? Did you run some update procedure?
Thanks, Denys
@denisok
The reason for killing pod was because I wanted to test if the pmm-server would work after a restart.
I have solved this issue so far, the problem occurred because I added a fsgroup
to the container. After removing it, pmm-server has started normally.
However, there is another problem: pmm-client will fail several times after every percona pod restart, and the pod will only work after a few error restarts. I don't understand what the reason for this is.
@cdmikechen
what version of a helm chart (pmm chart version) and repo do you use for PMM?
What logs and events shows for that pod and all containers in it?
@denisok The helm chart version is 1.2.1. Here is the pmm-client logs:
[36mINFO[0m[2023-04-21T17:37:15.410+08:00] Run setup: true Sidecar mode: true [36mcomponent[0m=entrypoint
[36mINFO[0m[2023-04-21T17:37:15.410+08:00] Starting pmm-agent for liveness probe... [36mcomponent[0m=entrypoint
[36mINFO[0m[2023-04-21T17:37:15.410+08:00] Starting 'pmm-admin setup'... [36mcomponent[0m=entrypoint
[36mINFO[0m[2023-04-21T17:37:15.552+08:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml. [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/node_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/rds_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/azure_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/vmagent [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Runner capacity set to 32. [36mcomponent[0m=runner
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml. [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/node_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/rds_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/azure_exporter [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.553+08:00] Using /usr/local/percona/pmm2/exporters/vmagent [36mcomponent[0m=main
[36mINFO[0m[2023-04-21T17:37:15.554+08:00] Window check connection time is 1.00 hour(s)
[36mINFO[0m[2023-04-21T17:37:15.554+08:00] Starting... [36mcomponent[0m=client
[31mERRO[0m[2023-04-21T17:37:15.554+08:00] Agent ID is not provided, halting. [31mcomponent[0m=client
[36mINFO[0m[2023-04-21T17:37:15.554+08:00] Starting local API server on http://0.0.0.0:7777/ ... [36mcomponent[0m=local-server/JSON
[36mINFO[0m[2023-04-21T17:37:15.556+08:00] Started. [36mcomponent[0m=local-server/JSON
[36mINFO[0m[2023-04-21T17:37:15.559+08:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml. [36mcomponent[0m=setup
[36mINFO[0m[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/node_exporter [36mcomponent[0m=setup
[36mINFO[0m[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter [36mcomponent[0m=setup
[36mINFO[0m[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter [36mcomponent[0m=setup
[36mINFO[0m[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter [36mcomponent[0m=setup
[36mINFO[0m[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter [36mcomponent[0m=setup
[36mINFO[0m[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/rds_exporter [36mcomponent[0m=setup
[36mINFO[0m[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/azure_exporter [36mcomponent[0m=setup
[36mINFO[0m[2023-04-21T17:37:15.559+08:00] Using /usr/local/percona/pmm2/exporters/vmagent [36mcomponent[0m=setup
Checking local pmm-agent status...
pmm-agent is running.
Registering pmm-agent on PMM Server...
Registered.
Configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml updated.
Reloading pmm-agent configuration...
[36mINFO[0m[2023-04-21T17:37:15.887+08:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml. [36mcomponent[0m=local-server
[36mINFO[0m[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/node_exporter [36mcomponent[0m=local-server
[36mINFO[0m[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter [36mcomponent[0m=local-server
[36mINFO[0m[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter [36mcomponent[0m=local-server
[36mINFO[0m[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter [36mcomponent[0m=local-server
[36mINFO[0m[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter [36mcomponent[0m=local-server
[36mINFO[0m[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/rds_exporter [36mcomponent[0m=local-server
[36mINFO[0m[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/azure_exporter [36mcomponent[0m=local-server
[36mINFO[0m[2023-04-21T17:37:15.888+08:00] Using /usr/local/percona/pmm2/exporters/vmagent [36mcomponent[0m=local-server
[36mINFO[0m[2023-04-21T17:37:15.888+08:00] Stopped. [36mcomponent[0m=local-server/JSON
[36mINFO[0m[2023-04-21T17:37:15.890+08:00] Done. [36mcomponent[0m=local-server
[36mINFO[0m[2023-04-21T17:37:15.890+08:00] Done. [36mcomponent[0m=supervisor
[36mINFO[0m[2023-04-21T17:37:15.890+08:00] Done. [36mcomponent[0m=main
Checking local pmm-agent status...
pmm-agent is not running.
[36mINFO[0m[2023-04-21T17:37:20.901+08:00] 'pmm-admin setup' exited with 0 [36mcomponent[0m=entrypoint
[36mINFO[0m[2023-04-21T17:37:20.901+08:00] Stopping pmm-agent... [36mcomponent[0m=entrypoint
[31mFATA[0m[2023-04-21T17:37:20.901+08:00] Failed to kill pmm-agent: os: process already finished [31mcomponent[0m=entrypoint
Hi. I think the pmm-client failing is much similar to this issue that I've created: https://jira.percona.com/browse/PMM-11893
I ran into the same issue with pmm-server using the helm chart version 1.2.5 and pmm-server 2.39.0. I did not set any security context in the helm chart values and the deployed sts had them empty.
I then learned our k8s cluster applies a default security context at both the pod and container level, here is the pod security context:
securityContext:
fsGroup: 1
seccompProfile:
type: RuntimeDefault
supplementalGroups:
- 1
After a restart, this is what /srv
permissions would look like:
[root@ads-pmm-stage-0-0 opt] # ls -alh /srv
total 72K
drwxrwsr-x. 13 root bin 4.0K Aug 22 04:47 .
dr-xr-xr-x. 1 root root 4.0K Aug 22 04:54 ..
drwxrwsr-x. 3 root bin 4.0K Aug 22 04:47 alerting
drwxrwsr-x. 4 pmm bin 4.0K Aug 22 04:47 alertmanager
drwxrwsr-x. 2 root bin 4.0K Aug 22 04:47 backup
drwxrwsr-x. 13 root bin 4.0K Aug 22 04:54 clickhouse
drwxrwsr-x. 6 grafana bin 4.0K Aug 22 04:54 grafana
drwxrwsr-x. 2 pmm bin 4.0K Aug 22 04:46 logs
drwxrws---. 2 root bin 16K Aug 22 04:46 lost+found
drwxrwsr-x. 2 root bin 4.0K Aug 22 04:46 nginx
-rw-rw-r--. 1 root bin 7 Aug 22 04:46 pmm-distribution
drwxrws---. 20 postgres bin 4.0K Aug 22 04:52 postgres14
drwxrwsr-x. 3 pmm bin 4.0K Aug 22 04:46 prometheus
drwxrwsr-x. 3 pmm bin 4.0K Aug 22 04:46 victoriametrics
After some trial and error, I found this helm chart value allowed pmm to survive restarts
podSecurityContext:
fsGroupChangePolicy: OnRootMismatch
The effective pod security context:
securityContext:
fsGroup: 1
fsGroupChangePolicy: OnRootMismatch
seccompProfile:
type: RuntimeDefault
supplementalGroups:
- 1
Starting fresh, this is what /srv
looked like on first boot:
[root@ads-pmm-stage-0-0 opt] # ls -alh /srv
total 72K
drwxrwsr-x. 13 root bin 4.0K Aug 22 19:48 .
dr-xr-xr-x. 1 root root 4.0K Aug 22 19:47 ..
drwxr-sr-x. 3 root bin 4.0K Aug 22 19:47 alerting
drwxrwxr-x. 4 pmm pmm 4.0K Aug 22 19:47 alertmanager
drwxr-sr-x. 2 root bin 4.0K Aug 22 19:48 backup
drwxr-sr-x. 13 root bin 4.0K Aug 22 19:47 clickhouse
drwxr-sr-x. 6 grafana render 4.0K Aug 22 19:48 grafana
drwxr-sr-x. 2 pmm pmm 4.0K Aug 22 19:47 logs
drwxrws---. 2 root bin 16K Aug 22 19:47 lost+found
drwxr-sr-x. 2 root bin 4.0K Aug 22 19:47 nginx
-rw-r--r--. 1 root bin 7 Aug 22 19:47 pmm-distribution
drwx------. 20 postgres postgres 4.0K Aug 22 19:47 postgres14
drwxr-sr-x. 3 pmm pmm 4.0K Aug 22 19:47 prometheus
drwxrwxr-x. 3 pmm pmm 4.0K Aug 22 19:47 victoriametrics
and reboot:
[root@ads-pmm-stage-0-0 opt] # ls -alh /srv
total 72K
drwxrwsr-x. 13 root bin 4.0K Aug 22 19:48 .
dr-xr-xr-x. 1 root root 4.0K Aug 22 19:53 ..
drwxr-sr-x. 3 root bin 4.0K Aug 22 19:47 alerting
drwxrwxr-x. 4 pmm pmm 4.0K Aug 22 19:47 alertmanager
drwxr-sr-x. 2 root bin 4.0K Aug 22 19:48 backup
drwxr-sr-x. 13 root bin 4.0K Aug 22 19:54 clickhouse
drwxr-sr-x. 6 grafana render 4.0K Aug 22 19:53 grafana
drwxr-sr-x. 2 pmm pmm 4.0K Aug 22 19:47 logs
drwxrws---. 2 root bin 16K Aug 22 19:47 lost+found
drwxr-sr-x. 2 root bin 4.0K Aug 22 19:47 nginx
-rw-r--r--. 1 root bin 7 Aug 22 19:47 pmm-distribution
drwx------. 20 postgres postgres 4.0K Aug 22 19:53 postgres14
drwxr-sr-x. 3 pmm pmm 4.0K Aug 22 19:47 prometheus
drwxrwxr-x. 3 pmm pmm 4.0K Aug 22 19:47 victoriametrics
I hope there are plans to support running without root