postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

Cannot set `use_endpoints: false` for PGO

Open waldner opened this issue 1 year ago • 4 comments

Overview

Add a concise description of what the bug is.

Environment

  • Platform: OpenShift
  • Platform Version: unsure, k8s 1.25.12
  • PGO Image Tag: ubi8-15.3-0
  • Postgres Version 15

Steps to Reproduce

REPRO

Provide steps to get to the error condition:

Deploy a PostgresCluster object.

EXPECTED

The database container should successfully start.

ACTUAL

The database container reports errors.

Logs

The log is full of messages like:

2024-01-25 18:07:10,329 INFO: Lock owner: hydra-db-00-xhgx-0; I am hydra-db-00-xhgx-0
2024-01-25 18:07:10,379 ERROR: Permission denied
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 1023, in _update_leader_with_retry
    return self._patch_or_create(self.leader_path, annotations, resource_version, ips=ips, retry=_retry)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 966, in _patch_or_create
    ret = retry(func, self._namespace, body) if retry else func(self._namespace, body)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 1020, in _retry
    return retry(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/utils.py", line 334, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 514, in wrapper
    return getattr(self._core_v1_api, func)(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 450, in wrapper
    return self._api_client.call_api(method, path, headers, body, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 419, in call_api
    return self._handle_server_response(response, _preload_content)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 249, in _handle_server_response
    raise k8s_client.rest.ApiException(http_resp=response)
patroni.dcs.kubernetes.K8sClient.rest.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '38053985-eab3-4c37-922f-e8dbaecaef0c', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'X-Kubernetes-Pf-Flowschema-Uid': '6011019b-e5bc-49a7-8924-eb3a8944e1f1', 'X-Kubernetes-Pf-Prioritylevel-Uid': '39165705-4df3-4770-8f74-812a0fa2c009', 'Date': 'Thu, 25 Jan 2024 18:07:10 GMT', 'Content-Length': '269'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"endpoints \\"hydra-db-ha\\" is forbidden: endpoint address 10.130.5.91 is not allowed","reason":"Forbidden","details":{"name":"hydra-db-ha","kind":"endpoints"},"code":403}\n'

2024-01-25 18:07:10,379 ERROR: failed to update leader lock
2024-01-25 18:07:10,379 INFO: not promoting because failed to update leader lock in DCS

Additional Information

From what I could find, this is due to patroni using use_endpoints: true on Openshift, where it should instead use use_endpoints: false to use ConfigMaps (correct me if I'm wrong).

waldner avatar Jan 25 '24 18:01 waldner

Hello @waldner! We've always set use_endpoints: true in v5 and have plenty of users running on Openshift, so you don't have to set use_endpoints to false, but I do think there are a couple of things you must do to avoid this issue. The first is add create permissions for the endpoints and endpoints/restricted resources in your RBAC, which you can see we've done in our examples here and here. Another thing is that your pods should not be running in any of Openshift's default namespaces. From the Openshift docs:

You cannot assign a SCC to pods created in one of the default namespaces: default, kube-system, kube-public, openshift-node, openshift-infra, openshift. These namespaces should not be used for running pods or services.

There might even be other things, but let's start there. Can you check these things? If you've got the correct RBAC and you're not running pods in a default namespace, please answer a few questions: What version of Openshift are you using? What version of CPK? Can you send PGO logs?

dsessler7 avatar Jan 26 '24 22:01 dsessler7

I'm not running in the default namespace, however now I've checked the manifests and I've noticed that the PostgresCluster object has openshift: false (to be investigated why). Could it be the source of the problem?

waldner avatar Jan 26 '24 23:01 waldner

Indeed removing openshift: false makes the database container work, however now I'm hitting this one: https://github.com/CrunchyData/postgres-operator/issues/3707 (it worked before, probably due to the openshift: false setting). And indeed I see that the pod is running with the anyuid SCC instead of the restricted one. I don't have permissions to change the policies themselves.

EDIT: this is due to the default service account being bound to the anyuid SCC policy (probably as a workaround to make something else work...I'll find out the details but I'd rather not touch this now). At the same time, I see that it's not possible to use a service account other than default for this pod (see https://github.com/CrunchyData/postgres-operator/issues/2749).

Any easy way out of the mess?

waldner avatar Jan 26 '24 23:01 waldner

Hi @waldner. I wanted to reach out and see if you are still having this issue. One quick suggestion would be to explicitly set openshift: true just in case for some reason it's not being set correctly by default (as mentioned in this troubleshooting section of the documentation. Beyond that, the SCC configuration you described does sound like it would take more digging to determine the best path forward. Have you tried creating a fresh Postgres cluster from scratch and seeing if that comes up as expected?

tjmoore4 avatar Feb 22 '24 20:02 tjmoore4

Since we haven't heard back on this issue for some time, I am closing this issue. If you need further assistance, feel free to re-open this issue or ask a question in our Discord server.

ValClarkson avatar Apr 30 '24 20:04 ValClarkson