postgres-operator Postgres operator does not recover instance if K8S API temporarily unavailable

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.7.1
Where do you run it - cloud or metal? Kubernetes or OpenShift? Syseleven Metakube (Openshift)
Are you running Postgres Operator in production? yes
Type of issue? Bug report

We're using spilo image registry.opensource.zalan.do/acid/spilo-13:2.1-p1 and postgres version 13.

Unfortunately our manageg kubernetes provider has not the best availabilty for the k8s api. If the api is not available temporarily, one, multiple or all patroni cluster members fail and can't recover. If cluster status is checked with patronictl list the members of the cluster claiming "start failed".

The reason they can't (re)start seems to be, that somehow the pvc permissions get messed up, when the api problem happens:

2021-11-24 03:18:18 UTC [159287]: [1-1] 619daefa.26e37 0     FATAL:  data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
2021-11-24 03:18:18 UTC [159287]: [2-1] 619daefa.26e37 0     DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).

If the instace is an replica you can fix the problem by patronictl reinit or restarting the pod.

Unfortunatly this can also happen for the leader. This can be fixed by restarting the pod.

When all 3 nodes become unready only a k -n namespcae delete po -l application=spilo --force --grace-period=0 can recover the cluster.

The main problem here is that the patroni cluster is not able to recover from such issue by itself. If the leader is in "start failed" status, i would expect that one of the available replicas becomes the new master.

I also don't understand, why a not available K8S API leads to changes in the file permissions of the cluster member at all?

This is the complete postgres log of such case:

2021-11-24 03:17:00.804 31 LOG {ticks: 0, maint: 0, retry: 0}
2021-11-24 03:17:07,516 INFO: no action. I am (auth-postgres-db-1) the leader with the lock
2021-11-24 03:17:17,518 INFO: no action. I am (auth-postgres-db-1) the leader with the lock
2021-11-24 03:17:27,512 INFO: no action. I am (auth-postgres-db-1) the leader with the lock
2021-11-24 03:17:30.820 31 LOG {ticks: 0, maint: 0, retry: 0}
2021-11-24 03:17:38,183 INFO: no action. I am (auth-postgres-db-1) the leader with the lock
2021-11-24 03:17:47,515 INFO: no action. I am (auth-postgres-db-1) the leader with the lock
2021-11-24 03:17:57,519 INFO: no action. I am (auth-postgres-db-1) the leader with the lock
2021-11-24 03:17:59,854 ERROR: ObjectCache.run ProtocolError("Connection broken: TimeoutError(110, 'Connection timed out')", TimeoutError(110, 'Connection timed out'))
2021-11-24 03:18:00.820 31 LOG {ticks: 0, maint: 0, retry: 0}
2021-11-24 03:18:02,357 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0416e544e0>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db
2021-11-24 03:18:04,862 ERROR: Request to server https://10.10.10.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.10.10.1', port=443): Max retries exceeded with url: /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414abd518>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)'))",)
2021-11-24 03:18:05,863 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2021-11-24 03:18:09,003 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='10.10.10.1', port=443): Read timed out. (read timeout=2.5)",)': /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db
2021-11-24 03:18:11,505 ERROR: Request to server https://10.10.10.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.10.10.1', port=443): Max retries exceeded with url: /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0416e546d8>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)'))",)
2021-11-24 03:18:12,507 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2021-11-24 03:18:15,011 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0416e54358>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db
2021-11-24 03:18:17,498 ERROR: get_cluster
2021-11-24 03:18:17,500 ERROR: Error communicating with DCS
2021-11-24 03:18:17,515 ERROR: Request to server https://10.10.10.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.10.10.1', port=443): Max retries exceeded with url: /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0416e545c0>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)'))",)
2021-11-24 03:18:18,046 INFO: closed patroni connection to the postgresql cluster
2021-11-24 03:18:18,318 INFO: postmaster pid=159287
2021-11-24 03:18:18 UTC [159287]: [1-1] 619daefa.26e37 0     FATAL:  data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
2021-11-24 03:18:18 UTC [159287]: [2-1] 619daefa.26e37 0     DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2021-11-24 03:18:18,517 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2021-11-24 03:18:19,347 ERROR: postmaster is not running
2021-11-24 03:18:19,348 INFO: demoted self because DCS is not accessible and i was a leader
2021-11-24 03:18:19,351 WARNING: Loop time exceeded, rescheduling immediately.
2021-11-24 03:18:21,022 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414a27fd0>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db
2021-11-24 03:18:23,526 ERROR: Request to server https://10.10.10.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.10.10.1', port=443): Max retries exceeded with url: /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414a27f60>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)'))",)
2021-11-24 03:18:24,527 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2021-11-24 03:18:27,031 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414a9bf28>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db
2021-11-24 03:18:29,351 ERROR: get_cluster
2021-11-24 03:18:29,352 ERROR: Error communicating with DCS
2021-11-24 03:18:29,353 INFO: DCS is not accessible
2021-11-24 03:18:29,354 WARNING: Loop time exceeded, rescheduling immediately.
2021-11-24 03:18:29,535 ERROR: Request to server https://10.10.10.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.10.10.1', port=443): Max retries exceeded with url: /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414a9bac8>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)'))",)
2021-11-24 03:18:30,537 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2021-11-24 03:18:30.843 31 LOG {ticks: 0, maint: 0, retry: 0}
2021-11-24 03:18:33,041 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0416e1d5f8>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db
2021-11-24 03:18:36,551 ERROR: Request to server https://10.10.10.1:443 failed: MaxRetryError('HTTPSConnectionPool(host=\'10.10.10.1\', port=443): Max retries exceeded with url: /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db (Caused by ReadTimeoutError("HTTPSConnectionPool(host=\'10.10.10.1\', port=443): Read timed out. (read timeout=2.5)",))',)
2021-11-24 03:18:37,552 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2021-11-24 03:18:39,354 ERROR: get_cluster
2021-11-24 03:18:39,355 ERROR: Error communicating with DCS
2021-11-24 03:18:39,355 INFO: DCS is not accessible
2021-11-24 03:18:39,356 WARNING: Loop time exceeded, rescheduling immediately.
2021-11-24 03:18:40,055 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0416e1d4a8>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db
2021-11-24 03:18:42,559 ERROR: Request to server https://10.10.10.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.10.10.1', port=443): Max retries exceeded with url: /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f04149f5128>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)'))",)
2021-11-24 03:18:43,561 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2021-11-24 03:18:46,996 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='10.10.10.1', port=443): Read timed out. (read timeout=2.5)",)': /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db
2021-11-24 03:18:49,356 ERROR: get_cluster
2021-11-24 03:18:49,357 ERROR: Error communicating with DCS
2021-11-24 03:18:49,357 INFO: DCS is not accessible
2021-11-24 03:18:49,359 WARNING: Loop time exceeded, rescheduling immediately.
2021-11-24 03:18:49,500 ERROR: Request to server https://10.10.10.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.10.10.1', port=443): Max retries exceeded with url: /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414a68668>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)'))",)
2021-11-24 03:18:50,501 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2021-11-24 03:18:53,005 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414a68048>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db
2021-11-24 03:18:55,506 ERROR: Request to server https://10.10.10.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.10.10.1', port=443): Max retries exceeded with url: /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414a68e10>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)'))",)
2021-11-24 03:18:56,508 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2021-11-24 03:18:59,010 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414a68080>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db
2021-11-24 03:18:59,359 ERROR: get_cluster
2021-11-24 03:18:59,359 ERROR: Error communicating with DCS
2021-11-24 03:18:59,359 INFO: DCS is not accessible
2021-11-24 03:18:59,361 WARNING: Loop time exceeded, rescheduling immediately.
2021-11-24 03:19:00.834 31 ERROR connection error: PQconnectStart
2021-11-24 03:19:00.834 31 ERROR 
2021-11-24 03:19:00.834 31 ERROR connection error: PQconnectStart
2021-11-24 03:19:00.834 31 ERROR 
2021-11-24 03:19:00.834 31 WARNING postgres: default timeout
2021-11-24 03:19:00.834 31 ERROR connection error: PQconnectStart
2021-11-24 03:19:00.834 31 ERROR 
2021-11-24 03:19:00.834 31 WARNING hydra: default timeout
2021-11-24 03:19:00.846 31 LOG {ticks: 0, maint: 0, retry: 0}
2021-11-24 03:19:01,513 ERROR: Request to server https://10.10.10.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.10.10.1', port=443): Max retries exceeded with url: /api/v1/namespaces/auth/endpoints?labelSelector=application%3Dspilo%2Ccluster-name%3Dauth-postgres-db (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414a68d30>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)'))",)
2021-11-24 03:19:02,515 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2021-11-24 03:19:05,332 WARNING: Postgresql is not running.
2021-11-24 03:19:05,332 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:05,337 INFO: pg_controldata:
2021-11-24 03:19:05,337 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:05,357 INFO: Local timeline=1 lsn=3/27000028
2021-11-24 03:19:05,372 INFO: master_timeline=2
2021-11-24 03:19:05,377 INFO: master: history=1	3/270000A0	no recovery target specified
2021-11-24 03:19:05,378 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:07,881 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f04155377f0>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/pods/auth-postgres-db-1
2021-11-24 03:19:10,385 INFO: starting as a secondary
2021-11-24 03:19:10,388 WARNING: Loop time exceeded, rescheduling immediately.
2021-11-24 03:19:10 UTC [159304]: [1-1] 619daf2e.26e48 0     FATAL:  data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
2021-11-24 03:19:10 UTC [159304]: [2-1] 619daf2e.26e48 0     DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2021-11-24 03:19:10,708 INFO: postmaster pid=159304
2021-11-24 03:19:10,390 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:12,896 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0416e07a58>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/pods/auth-postgres-db-1
2021-11-24 03:19:14,594 WARNING: Retry got exception: 'connection problems'
2021-11-24 03:19:15,399 INFO: restarting after failure in progress
2021-11-24 03:19:15,416 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:17,921 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414abda58>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/pods/auth-postgres-db-1
2021-11-24 03:19:20,425 INFO: failed to start postgres
2021-11-24 03:19:20,426 WARNING: Postgresql is not running.
2021-11-24 03:19:20,426 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:20,430 INFO: pg_controldata:
2021-11-24 03:19:20,431 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:20,449 INFO: Local timeline=1 lsn=3/27000028
2021-11-24 03:19:20,461 INFO: master_timeline=2
2021-11-24 03:19:20,466 INFO: master: history=1	3/270000A0	no recovery target specified
2021-11-24 03:19:20,467 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:22,971 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414a2ae10>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/pods/auth-postgres-db-1
2021-11-24 03:19:25,472 INFO: starting as a secondary
2021-11-24 03:19:25 UTC [159318]: [1-1] 619daf3d.26e56 0     FATAL:  data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
2021-11-24 03:19:25 UTC [159318]: [2-1] 619daf3d.26e56 0     DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2021-11-24 03:19:25,714 INFO: postmaster pid=159318
2021-11-24 03:19:25,743 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:28,246 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0414ada7f0>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/pods/auth-postgres-db-1
2021-11-24 03:19:30,750 INFO: failed to start postgres
2021-11-24 03:19:30.857 31 LOG {ticks: 0, maint: 0, retry: 0}
2021-11-24 03:19:31,224 WARNING: Postgresql is not running.
2021-11-24 03:19:31,224 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:31,229 INFO: pg_controldata:
2021-11-24 03:19:31,229 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:31,246 INFO: Local timeline=1 lsn=3/27000028
2021-11-24 03:19:31,256 INFO: master_timeline=2
2021-11-24 03:19:31,260 INFO: master: history=1	3/270000A0	no recovery target specified
2021-11-24 03:19:31,260 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:33,764 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0416dd3860>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/pods/auth-postgres-db-1
2021-11-24 03:19:36,268 INFO: starting as a secondary
2021-11-24 03:19:36 UTC [159331]: [1-1] 619daf48.26e63 0     FATAL:  data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
2021-11-24 03:19:36 UTC [159331]: [2-1] 619daf48.26e63 0     DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2021-11-24 03:19:36,512 INFO: postmaster pid=159331
2021-11-24 03:19:36,526 INFO: Lock owner: auth-postgres-db-0; I am auth-postgres-db-1
2021-11-24 03:19:39,031 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0416dcb390>, 'Connection to 10.10.10.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/auth/pods/auth-postgres-db-1
2021-11-24 03:19:41,534 INFO: failed to start postgres
2021-11-24 03:19:41,535 WARNING: Postgresql is not running.

..... repeating unti restart...

Nov 24 '21 09:11 monotek

@monotek I've got the exact same problem today, on my Production cluster ... !

I'm using EKS with 1.21 Kubernetes version, the control plane API were unvailable for quite some time this afternoon (between 2:30pm and 3:30pm UTC+1).

Using postgres-operator:v1.7.0 and spilo-13:2.1-p1 and Postgres v13.

Running AWS EKS - eu-west-3 - Paris on 3 availability zones (A,B,C)

I restarted all the 3 Postgres pods when the API was back online, and the cluster finally elected a leader.

After some digging i've found that some of the cluster nodes went to "Not Ready" status, which has potentially lead to a Postgres pod rescheduling, this is probably linked to the control plane failure.

Dec 18 '21 17:12 theobolo

Great. We're just in the process to migrate to eks. Hoped the problem would be gone there :/

Have you seen the permission problem in the logs too and a "patronictl list" in one of the pods says the status of at least one cluster member is "start failed"?

As the operator does not support livenessprobes our "workaround" is now a custom sidecar, which checks for this state and kills the pod via curl on the kubernetes Api, so the PVC is remounted and the permissions fixed😬

Dec 18 '21 20:12 monotek

@monotek I didn't see any permission problem, except on the kubernetes API call made by patroni. I didn't experienced your "Disk permission" problem and i've didn't find any "patronictl list" in the logs.

Each Postgres cluster member went down one after another, and yes i've seen some "start failed" messages about Postgres daemon : "Postgres is not started", probably because it wasn't able to retrive the cluster status from Kubernetes API.

For the moment, we've made some manual fixes, but we'r in the process to migrate to GKE x) so ... i hope the API will be a little more stable with Google, which is apparently a good bet.

Btw, is this "normal" that the cluster is not able to seft-heal in such situation ?

Dec 20 '21 09:12 theobolo

No, as the operator does not support livenessprobes. See: https://github.com/zalando/postgres-operator/issues/587 and linked issues.

Dec 20 '21 10:12 monotek

We are currently experiencing the same issue quite regularily (almost daily) on our Linode Managed k8s instance. As this is quite the burden i would greatly appreciate any hints on how to mitigate the problem until it is resolved - and wanted to ask if somebody is currently investigating this issue and has some more insight on the problem?

Jan 31 '22 08:01 joseffilzmaier

We cam up with this hack, running as a sidecar 🙄 https://github.com/monotek/patroni-selfheal

Jan 31 '22 09:01 monotek

@monotek Thanks for sharing

Have you tried using the patroni configuration bypass_api_service? I have enabled this on our testserver and am now inspecting if this could solve the problem.

Feb 12 '22 13:02 joseffilzmaier

No, our hack works quite well. Since we use it we never had that problem again.

But it would be more nice if the operator just would support configurable health checks, as alreaydy requested in: https://github.com/zalando/postgres-operator/issues/587

Feb 12 '22 15:02 monotek

@CyberDem0n comments?

Aug 18 '22 12:08 Jan-M

+1

We carried out a k8 version upgrade from 1.19 to 1.20 and faced similar issues. Post the upgrade, noticed that the file permissions changed in data directory.

drwxrws---. 19 postgres postgres 4096 Aug 10 00:00 /home/postgres/pgdata/pgroot/data

If in any case postgres gets restarted in the pod post this, then it fails to come up.

Error logs

2022-08-10 12:38:26,154 INFO: starting as readonly because i had the session lock
2022-08-10 12:38:26 UTC [10413]: [1-1] 62f3a6c2.28ad 0     FATAL:  data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
2022-08-10 12:38:26 UTC [10413]: [2-1] 62f3a6c2.28ad 0     DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
2022-08-10 12:38:26,489 INFO: postmaster pid=10413
/var/run/postgresql:5432 - no response

Aug 23 '22 12:08 RejoOommen

Observation: When fsGroup is defined and kubelet gets restarted(In our case during the k8 upgrade), permissions of the data directory changes. One way to handle this could be to define fsGroupChangePolicy: "OnRootMismatch" which I see is placed as a feature request in #1850

Aug 24 '22 14:08 RejoOommen

I've got issue looks like the same:

2023-02-27 06:02:09.181 35 LOG {ticks: 0, maint: 0, retry: 0}
2023-02-27 06:02:18,013 INFO: Lock owner: postgresql-cluster-0; I am postgresql-cluster-0
2023-02-27 06:02:25,729 INFO: no action.  i am the leader with the lock
2023-02-27 06:02:28,361 INFO: Lock owner: postgresql-cluster-0; I am postgresql-cluster-0
2023-02-27 06:02:29,077 INFO: no action.  i am the leader with the lock
2023-02-27 06:02:38,585 INFO: Lock owner: postgresql-cluster-0; I am postgresql-cluster-0
2023-02-27 06:02:39.292 35 LOG {ticks: 0, maint: 0, retry: 0}
2023-02-27 06:02:50,601 WARNING: Retry got exception: HTTPSConnectionPool(host='10.96.0.1', port=443): Read timed out. (read timeout=3.198475961999975)
2023-02-27 06:02:50,603 ERROR: failed to update leader lock
2023-02-27 06:02:55,555 INFO: demoted self because failed to update leader lock in DCS
2023-02-27 06:02:55,603 WARNING: Loop time exceeded, rescheduling immediately.
2023-02-27 06:02:55,604 INFO: Lock owner: postgresql-cluster-0; I am postgresql-cluster-0
2023-02-27 06:02:55,605 INFO: closed patroni connection to the postgresql cluster
2023-02-27 06:02:55,782 INFO: updated leader lock during starting after demotion
2023-02-27 06:02:57,642 INFO: postmaster pid=327
/var/run/postgresql:5432 - no response
2023-02-27 06:02:57 UTC [327]: [1-1] 63fc4791.147 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2023-02-27 06:02:57 UTC [327]: [2-1] 63fc4791.147 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2023-02-27 06:02:57 UTC [327]: [3-1] 63fc4791.147 0     LOG:  starting PostgreSQL 12.4 (Ubuntu 12.4-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
2023-02-27 06:02:57 UTC [327]: [4-1] 63fc4791.147 0     LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-02-27 06:02:57 UTC [327]: [5-1] 63fc4791.147 0     LOG:  listening on IPv6 address "::", port 5432
2023-02-27 06:02:57 UTC [327]: [6-1] 63fc4791.147 0     LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-02-27 06:02:57 UTC [327]: [7-1] 63fc4791.147 0     LOG:  redirecting log output to logging collector process
2023-02-27 06:02:57 UTC [327]: [8-1] 63fc4791.147 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - accepting connections
/var/run/postgresql:5432 - accepting connections
2023-02-27 06:02:58,857 INFO: Lock owner: postgresql-cluster-0; I am postgresql-cluster-0
2023-02-27 06:02:58,858 INFO: establishing a new patroni connection to the postgres cluster
2023-02-27 06:02:58,985 INFO: promoted self to leader because i had the session lock
server promoting
2023-02-27 06:02:59,027 INFO: cleared rewind state after becoming the leader
2023-02-27 06:03:00,043 INFO: Lock owner: postgresql-cluster-0; I am postgresql-cluster-0

Are there any setting to tune this timeout?

HTTPSConnectionPool(host='10.96.0.1', port=443): Read timed out.

Feb 27 '23 07:02 oneumyvakin

We have also seen this issue on a RKE2 test cluster, we suspect it happens for us when the VM is migrated during nightly upgrade of the base OS.

It can be fairly easily reproduced in this setup by simply pausing the kube-apiserver process for ~40 sec then waiting for the permissions for the data mount to change. then pause again for ~40 sec and see the database demote and fail because of permissions.

The pod does not report not ready to kubernetes it stays in the Running state.

Here is a snippet of the code I use to reproduce

$ssh pkill --echo -STOP "kube-apiserver"
sleep 40
$ssh pkill --echo -CONT "kube-apiserver"
sleep 30 # wait for API server to get back online

echo "Waiting for Permission change start time: $(date)"
waitForDBMountPermissionChange
echo "Done waiting for Permission change time: $(date)"


echo "Pausing again to force primary to step down"
$ssh pkill --echo -STOP "kube-apiserver"
sleep 40
$ssh pkill --echo -CONT "kube-apiserver"
sleep 30 # wait for API server to get back online

waitForDBFailureState

EDIT: The Permission change only happens for the Postgres Data mount, this indicates that it is something specifically related to Postgres, either the operator or the DB pod itself, has anyone been able to track down what process is making the permission change?

Nov 10 '23 14:11 KrauseStefan

postgres-operator postgres-operator copied to clipboard

Postgres operator does not recover instance if K8S API temporarily unavailable

postgres-operator
postgres-operator copied to clipboard