postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

pod-0 unhealthy --> Restart all 3 pods --> 2 healthy pods are not being launched/started at all --> DB cluster remains "stuck"/down

Open Samusername opened this issue 2 years ago • 3 comments

Hi! postgres-operator: v1.8.1 spilo: 2.1-p6 Patroni: 2.1.4 synchronous_mode: true synchronous_mode_strict: true We use "ConfigMap" setup instead of endpoints setup. System: Openshift Are you running Postgres Operator in production? A: Yes, in production pipe. Type of issue? Bug

pods: rdbms-pg-cluster-0 Replica --- We broke this up intentionally. We moved WAL files etc. to break it up. rdbms-pg-cluster-1 Leader rdbms-pg-cluster-2 Sync Standby

After we broke rdbms-pg-cluster-0: If we delete all 3 pods, then DB cluster does not start up: It ends up in following state:

kubectl get pods -A | grep rdbms namespace0 rdbms-pg-cluster-0 1/2 Running 0 43m namespace0 rdbms-pg-operator-6c8b55d586-c28xp 1/1 Running 0 21h

So, a problem:

Only one pod is shown / launching. rdbms-pg-cluster-1 nor rdbms-pg-cluster-2 are not even shown in the list of pods. So, High Availability is not working in this kind of a scenario, it seems.

Q: How should such be tried to be solved (or fixed)?

In postgres-operator or with pod_management_policy?

Reference: pod_management_policy, ordered_ready (default), (or parallel). https://opensource.zalando.com/postgres-operator/docs/reference/operator_parameters.html

We hesitate using "parallel" in pod_management_policy.

Logs from postgres-operator. Click to expand

time="2022-08-17T04:14:57Z" level=info msg="SYNC event has been queued" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0 time="2022-08-17T04:14:57Z" level=info msg="there are 1 clusters running" pkg=controller time="2022-08-17T04:14:57Z" level=info msg="Creating the role binding "postgres-pod" in the "namespace0" namespace" pkg=controller time="2022-08-17T04:14:57Z" level=warning msg="pods and/or Patroni may misfunction due to the lack of permissions: could not create role binding "postgres-pod" : cannot bind the pod service account "postgres-pod" defined in the configuration to the cluster role in the "namespace0" namespace: clusterroles.rbac.authorization.k8s.io "postgres-pod" not found" pkg=controller time="2022-08-17T04:14:57Z" level=info msg="syncing of the cluster started" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0 time="2022-08-17T04:14:57Z" level=debug msg="team API is disabled" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:14:57Z" level=info msg="syncing secrets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:14:57Z" level=debug msg="syncing master service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:14:57Z" level=debug msg="syncing replica service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:14:57Z" level=debug msg="syncing volumes using "pvc" storage resize mode" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:14:57Z" level=info msg="volume claims do not require changes" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:14:57Z" level=debug msg="syncing statefulsets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:14:57Z" level=debug msg="making GET http request: http://10.129.x3.y3:8008/config" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:15:09Z" level=debug msg="making GET http request: http://10.129.x3.y3:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:15:09Z" level=debug msg="making GET http request: http://10.129.x4.y4:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:15:09Z" level=debug msg="making GET http request: http://10.129.x5.y5:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:15:09Z" level=debug msg="syncing pod disruption budgets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 W0817 04:15:09.577175 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget time="2022-08-17T04:15:09Z" level=debug msg="syncing roles" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:15:09Z" level=debug msg="closing database connection" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:15:09Z" level=debug msg="syncing databases" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:15:09Z" level=debug msg="closing database connection" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:15:09Z" level=debug msg="syncing prepared databases with schemas" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:15:09Z" level=debug msg="syncing connection pooler (master, replica) from (false, nil) to (false, nil)" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:15:09Z" level=info msg="cluster has been synced" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0 time="2022-08-17T04:44:57Z" level=info msg="SYNC event has been queued" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0 time="2022-08-17T04:44:57Z" level=info msg="there are 1 clusters running" pkg=controller time="2022-08-17T04:44:57Z" level=info msg="Creating the role binding "postgres-pod" in the "namespace0" namespace" pkg=controller time="2022-08-17T04:44:57Z" level=warning msg="pods and/or Patroni may misfunction due to the lack of permissions: could not create role binding "postgres-pod" : cannot bind the pod service account "postgres-pod" defined in the configuration to the cluster role in the "namespace0" namespace: clusterroles.rbac.authorization.k8s.io "postgres-pod" not found" pkg=controller time="2022-08-17T04:44:57Z" level=info msg="syncing of the cluster started" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0 time="2022-08-17T04:44:57Z" level=debug msg="team API is disabled" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:44:57Z" level=info msg="syncing secrets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:44:57Z" level=debug msg="syncing master service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:44:57Z" level=debug msg="syncing replica service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:44:57Z" level=debug msg="syncing volumes using "pvc" storage resize mode" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:44:57Z" level=info msg="volume claims do not require changes" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:44:57Z" level=debug msg="syncing statefulsets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:44:57Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/config" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:45:09Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:45:11Z" level=debug msg="syncing pod disruption budgets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 W0817 04:45:11.038395 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget time="2022-08-17T04:45:11Z" level=debug msg="syncing roles" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:45:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:45:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:45:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:45:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:46:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:46:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:46:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:46:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:46:56Z" level=warning msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T04:46:56Z" level=error msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0 time="2022-08-17T05:14:57Z" level=info msg="SYNC event has been queued" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0 time="2022-08-17T05:14:57Z" level=info msg="there are 1 clusters running" pkg=controller time="2022-08-17T05:14:57Z" level=info msg="Creating the role binding "postgres-pod" in the "namespace0" namespace" pkg=controller time="2022-08-17T05:14:57Z" level=warning msg="pods and/or Patroni may misfunction due to the lack of permissions: could not create role binding "postgres-pod" : cannot bind the pod service account "postgres-pod" defined in the configuration to the cluster role in the "namespace0" namespace: clusterroles.rbac.authorization.k8s.io "postgres-pod" not found" pkg=controller time="2022-08-17T05:14:57Z" level=info msg="syncing of the cluster started" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0 time="2022-08-17T05:14:57Z" level=debug msg="team API is disabled" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:14:57Z" level=info msg="syncing secrets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:14:57Z" level=debug msg="syncing master service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:14:57Z" level=debug msg="syncing replica service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:14:57Z" level=debug msg="syncing volumes using "pvc" storage resize mode" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:14:57Z" level=info msg="volume claims do not require changes" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:14:57Z" level=debug msg="syncing statefulsets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:14:57Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/config" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:15:09Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:15:11Z" level=debug msg="syncing pod disruption budgets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 W0817 05:15:11.047404 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget time="2022-08-17T05:15:11Z" level=debug msg="syncing roles" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:15:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:15:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:15:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:15:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:16:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:16:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:16:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:16:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:16:56Z" level=warning msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0 time="2022-08-17T05:16:56Z" level=error msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0

Anything common with following? An older modification: https://github.com/zalando/postgres-operator/issues/1765 Fixed: Rolling upgrade does not proceed anymore, if pod ends up in unhealthy state during the rolling upgrade.

=========================== I am not sure are there WAs to fix the situation manually. PVC of rdbms-pg-cluster-0 pod was tried to be removed, etc. But still it other 2 pods did not launch at all. Also following text was shown in logs of rdbms-pg-cluster-0 pod: 2022-08-22 06:03:21,105 INFO: waiting for leader to bootstrap (Stays there.)

=========================== This seems like quite a serious problem currently. (Urgent to try to fix.)

Samusername avatar Aug 17 '22 05:08 Samusername

Same issue as #1978

christiaangoossens avatar Aug 20 '22 11:08 christiaangoossens

I am testing with: pod_management_policy: parallel link.

Rolling updates:

"Parallel" configuration does not seem to affect to rolling updates (or rolling upgrades). Like it has been documented:

Parallel Pod Management
...
This option only affects the behavior for scaling operations. **Updates are not affected**.

https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-management-policies I tested it also.

Samusername avatar Aug 22 '22 06:08 Samusername

Same issue:

root@acid-aneta-0:/home/postgres# patronictl list

  • Cluster: acid-aneta (7076099222737690693) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +--------------+--------------+---------+---------+----+-----------+ | acid-aneta-0 | 10.42.218.33 | Replica | stopped | | unknown | | acid-aneta-1 | 10.42.218.13 | Replica | running | 21 | 55163 | +--------------+--------------+---------+---------+----+-----------+ root@acid-aneta-0:/home/postgres# root@acid-aneta-0:/home/postgres# patronictl reinit acid-aneta acid-aneta-0
  • Cluster: acid-aneta (7076099222737690693) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +--------------+--------------+---------+---------+----+-----------+ | acid-aneta-0 | 10.42.218.33 | Replica | stopped | | unknown | | acid-aneta-1 | 10.42.218.13 | Replica | running | 21 | 55163 | +--------------+--------------+---------+---------+----+-----------+ Are you sure you want to reinitialize members acid-aneta-0? [y/N]: y 2022-09-03 08:22:17,559 - WARNING - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff0322924a8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /reinitialize 2022-09-03 08:22:17,559 - WARNING - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff032292780>: Failed to establish a new connection: [Errno 111] Connection refused',)': /reinitialize 2022-09-03 08:22:17,560 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff0322925f8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /reinitialize Traceback (most recent call last): File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 144, in _new_conn (self.host, self.port), self.timeout, **extra_kw) File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 83, in create_connection raise err File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen chunked=chunked) File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 357, in _make_request conn.request(method, url, **httplib_request_kw) File "/usr/lib/python3.6/http/client.py", line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output self.send(msg) File "/usr/lib/python3.6/http/client.py", line 980, in send self.connect() File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in connect conn = self._new_conn() File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 153, in _new_conn self, "Failed to establish a new connection: %s" % e) urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7ff032292b00>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/bin/patronictl", line 11, in sys.exit(ctl()) File "/usr/lib/python3/dist-packages/click/core.py", line 722, in call return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3/dist-packages/click/core.py", line 535, in invoke return callback(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/decorators.py", line 27, in new_func return f(get_current_context().obj, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/patroni/ctl.py", line 613, in reinit r = request_patroni(member, 'post', 'reinitialize', body) File "/usr/local/lib/python3.6/dist-packages/patroni/ctl.py", line 171, in request_patroni return request_executor(member, method, endpoint, data) File "/usr/local/lib/python3.6/dist-packages/patroni/request.py", line 53, in call return self.request(method, url, data, **kwargs) File "/usr/local/lib/python3.6/dist-packages/patroni/request.py", line 46, in request return self._pool.request(method.upper(), url, body=body, **kwargs) File "/usr/lib/python3/dist-packages/urllib3/request.py", line 70, in request **urlopen_kw) File "/usr/lib/python3/dist-packages/urllib3/request.py", line 148, in request_encode_body return self.urlopen(method, url, **extra_kw) File "/usr/lib/python3/dist-packages/urllib3/poolmanager.py", line 322, in urlopen response = conn.urlopen(method, u.request_uri, **kw) File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 668, in urlopen **response_kw) File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 668, in urlopen **response_kw) File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 668, in urlopen **response_kw) File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 639, in urlopen _stacktrace=sys.exc_info()[2]) File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 398, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.42.218.33', port=8008): Max retries exceeded with url: /reinitialize (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff032292b00>: Failed to establish a new connection: [Errno 111] Connection refused',)) root@acid-aneta-0:/home/postgres#

budiholan-github avatar Sep 03 '22 08:09 budiholan-github