postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

postgres-startup init fail

Open AlexisDesbonnets opened this issue 2 years ago • 23 comments

Overview

When I create a postgres cluster, the instance crash on initialization. The bug happened only on kubernetes 1.22.

Environment

  • Platform: Kubernetes on ovh managed kubernetes service
  • Platform Version: 1.22.2
  • PGO Image Tag: ubi8-5.0.4-0
  • Postgres Version: 13.5
  • Storage: csi-cinder-classic

Steps to Reproduce

REPRO

  1. Install postgres-operator example with customize
  2. Install the postgres hippo cluster defined in the example

EXPECTED

  1. The postgres cluster starts correctely.

ACTUAL

The postgres-startup container failed to init the cluster

Initializing ...
::postgres-operator: uid::26
::postgres-operator: gid::26
::postgres-operator: postgres path::/usr/pgsql-13/bin/postgres
::postgres-operator: postgres version::postgres (PostgreSQL) 13.5
::postgres-operator: config directory::/pgdata/pg13
::postgres-operator: data directory::/pgdata/pg13
install: cannot change permissions of ‘/pgdata/pg13’: No such file or directory
Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Warning  FailedScheduling        40s                default-scheduler        0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled               38s                default-scheduler        Successfully assigned postgres-operator/hippo-instance1-mzs4-0 to node-b4422cc5-9dc5-4116-a4d5-a194f6171309
  Normal   SuccessfulAttachVolume  35s                attachdetach-controller  AttachVolume.Attach succeeded for volume "ovh-managed-kubernetes-uo1sfr-pvc-729b89ad-c0dd-48f9-847b-36e36d8cc8e6"
  Normal   Pulled                  30s                kubelet                  Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0" in 1.260460237s
  Normal   Pulled                  28s                kubelet                  Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0" in 1.232337581s
  Normal   Pulling                 15s (x3 over 32s)  kubelet                  Pulling image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0"
  Normal   Created                 14s (x3 over 30s)  kubelet                  Created container postgres-startup
  Normal   Started                 14s (x3 over 30s)  kubelet                  Started container postgres-startup
  Normal   Pulled                  14s                kubelet                  Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0" in 1.226427728s
  Warning  BackOff                 13s (x3 over 27s)  kubelet                  Back-off restarting failed container

AlexisDesbonnets avatar Nov 23 '21 10:11 AlexisDesbonnets

Does your storage class require you to set up any supplemental groups, e.g.:

https://access.crunchydata.com/documentation/postgres-operator/v5/references/crd/#postgresclusterspec

There are some more notes about supplemental groups in the 5.0.3 release notes:

https://access.crunchydata.com/documentation/postgres-operator/v5/releases/5.0.3/

jkatz avatar Nov 23 '21 18:11 jkatz

Just wanted to mention that I'm hitting the same issue, on OVH Managed Kubernetes as well.

Environment

Platform: Kubernetes on ovh managed kubernetes service Platform Version: 1.22.2 PGO Image Tag: ubi8-5.0.1-0 Postgres Version: 13.3 Storage: csi-cinder-high-speed

Logs (from postgres-startup container)

[postgres-startup] Initializing ...
[postgres-startup] ::postgres-operator: uid::26
[postgres-startup] ::postgres-operator: gid::26 65534
[postgres-startup] ::postgres-operator: postgres path::/usr/pgsql-13/bin/postgres
[postgres-startup] ::postgres-operator: postgres version::postgres (PostgreSQL) 13.3
[postgres-startup] ::postgres-operator: config directory::/pgdata/pg13
[postgres-startup] ::postgres-operator: data directory::/pgdata/pg13
[postgres-startup] install: cannot change permissions of ‘/pgdata/pg13’: No such file or directory

I have been trying to resolve this issue for about an hour now, without much progress. (I am not very experienced with the details of the postgres-operator, or even kubernetes in general, but am trying to learn)

Note that I just (earlier this night) updated my OVH Kubernetes cluster from 1.20 to 1.22, so as the OP said, this is likely the cause of the issue. (though I can't rule out that there are other contributing causes/changes, as I started the reset+update to try to get around a persistent-volume not-mounting issue I was having).

Venryx avatar Dec 05 '21 12:12 Venryx

Okay, I reset + changed-version my cluster to 1.21 (1.21.5-0), and the error above is no longer being hit:

[postgres-startup] Initializing ...
[postgres-startup] ::postgres-operator: uid::26
[postgres-startup] ::postgres-operator: gid::26 65534
[postgres-startup] ::postgres-operator: postgres path::/usr/pgsql-13/bin/postgres
[postgres-startup] ::postgres-operator: postgres version::postgres (PostgreSQL) 13.3
[postgres-startup] ::postgres-operator: config directory::/pgdata/pg13
[postgres-startup] ::postgres-operator: data directory::/pgdata/pg13
[K8s EVENT: Pod debate-map-repo-host-0 (ns: postgres-operator)] Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-1" in 37.671615103s
[K8s EVENT: Pod debate-map-repo-host-0 (ns: postgres-operator)] Pulling image "registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-1"
[nss-wrapper-init] nss_wrapper: user exists
[nss-wrapper-init] nss_wrapper: group exists
[nss-wrapper-init] nss_wrapper: environment configured
[K8s EVENT: Pod debate-map-instance1-hfj5-0 (ns: postgres-operator)] Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres-ha:centos8-13.3-1" in 20.861051117s
[K8s EVENT: Pod debate-map-repo-host-0 (ns: postgres-operator)] Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-1" in 7.632752328s
[K8s EVENT: Pod debate-map-instance1-hfj5-0 (ns: postgres-operator)] Pulling image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres-ha:centos8-13.3-1"
Server listening on 0.0.0.0 port 2022.
Server listening on :: port 2022.
[...]

~So something about version 1.22 of Kubernetes is indeed causing the issue.~ (EDIT: Apparently it's more complicated than just this version change, as v1.21 also has issues [on a different cloud provider] for some of the commenters below.)

I spent a good 15 minutes searching through the v1.22+ changelog here, but did not notice anything likely to be the cause of the issue. Perhaps there is something that OVH changed internally about their provisioning for v1.22 nodes, beyond the changes to Kubernetes itself?

Anyway, if others are having the same issue (and are not using OVH Managed Kubernetes), please comment here so we can narrow down the source of the problem.

Venryx avatar Dec 05 '21 13:12 Venryx

In the Operator log on startup (the pgo-... Pod), could you see if there are any lines that indicate what spec.openshift is set to?

jkatz avatar Dec 06 '21 18:12 jkatz

I'm fairly confident this is similar to an issue that prompted #2897 and should be fixed in that.

The interim solution is to set the following in your PostgresCluster spec:

spec:
  openshift: false

jkatz avatar Dec 07 '21 02:12 jkatz

I just wanted to mention that I'm hitting the same issue, on Linode Managed Kubernetes service (LKE)as well. On both Kubernetes version 1.21.1 and 1.22.2

Environment Platform: Kubernetes on Linode managed kubernetes service (LKE) Platform Version: 1.21.1 and 1.22.2 PGO Image Tag: ubi8-5.0.3-0 Postgres Version: 13.4 Storage: csi-cinder-high-speed

Logs (from postgres-startup container)

Initializing ...
::postgres-operator: uid::26
::postgres-operator: gid::26
::postgres-operator: postgres path::/usr/pgsql-13/bin/postgres
::postgres-operator: postgres version::postgres (PostgreSQL) 13.4
::postgres-operator: config directory::/pgdata/pg13
::postgres-operator: data directory::/pgdata/pg13
install: cannot change permissions of ‘/pgdata/pg13’: No such file or directory

I tried with the suggestions of @jkatz configuring PostgresCluster like this

spec:
  openshift: false

and then

spec:
  supplementalGroups:
    - 65534

and then with both

spec:
  openshift: false
  supplementalGroups:
    - 65534

But nothing worked. Any idea?

Leedwing avatar Dec 25 '21 19:12 Leedwing

I'm in the same situation as @Leedwing here: LKE (I tested 1.21 and 1.22) using both specifications (openshit and/or supplementalGroups), but neither worked.

jisaitua avatar Dec 26 '21 19:12 jisaitua

Update

I deployed the postgres-operator-examples on DigitalOcean Kubernetes Service v1.20.11 and v1.21.1

Environment Platform: DigitalOcean Kubernetes service Platform Version: 1.20.11 and 1.21.5 PGO Image Tag: ubi8-5.0.3-0 Postgres Version: 13.4 Storage: csi-cinder-high-speed

AND EVERYTHING WORKED FINE!

I would say that on LKE, the issue might be caused by some internal behavior and probably related to permissions against block storage. Because also the monitoring component shipped with postgres-operator-examples that uses block storage cannot start on LKE due to the same kind of error: permission denied

  • Grafana
GF_PATHS_DATA='/data/grafana/data' is not writable.
You may have issues with file permissions, more information here: http://docs.grafana.org/installation/docker/#migrate-to-v51-or-later
...
...
service init failed: failed to connect to database: mkdir /data/grafana: permission denied
  • Prometheus
level=error ts=2021-12-27T19:36:47.147Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
  • Alertmanager
level=error ts=2021-12-27T19:41:09.010Z caller=nflog.go:365 component=nflog msg="Running maintenance failed" err="open /alertmanager/nflog.73fec05415888652: permission denied"
level=info ts=2021-12-27T19:41:09.011Z caller=silence.go:389 component=silences msg="Running maintenance failed" err="open /alertmanager/silences.195c39cdee14e24: permission denied

Leedwing avatar Dec 27 '21 20:12 Leedwing

I installed it a few weeks ago on LKE (1.21) and it was working (and I was surprised how easy it was), so I guess the problem started due to some changes in Postgres Operator.

jisaitua avatar Dec 27 '21 22:12 jisaitua

@jisaitua I also got it working few weeks ago on LKE (1.21). It was also working on kubernetes 1.21.5 shipped with docker desktop locally and is still working. I've checked the change log https://developers-linode.netlify.app/changelog/cloud-manager/ of Linode and noticed that they have done some update related to block storage (i guess this might be the cause the permission issue). Another point why i guess the change might be at Linode side is that, i got CNAME Records created with the kubernetes External DNS instead of A Records as it was few weeks ago. And with the same scripts everything is working as expected on DigitalOcean.

PS: i'am using the same postgres-operator-examples as few weeks ago. I didn't pull the new changes.

Leedwing avatar Dec 27 '21 23:12 Leedwing

PS: i'am using the same postgres-operator-examples as few weeks ago. I didn't pull the new changes.

@Leedwing Good point. If that's the case, then the problem looks like a Linode problem. Maybe @jkatz can help to find some way to solve it.

jisaitua avatar Dec 27 '21 23:12 jisaitua

Good point. If that's the case, then the problem looks like a Linode problem.

Linode + OVH Cloud, yes; perhaps it's a common error (or complication for postgres-operator) that the managed-kubernetes providers are making, in response to some upcoming Kubernetes change or something.

Venryx avatar Dec 28 '21 04:12 Venryx

Linode support responds to me saying: "I agree that this appears to be related to a change on our end. To be precise, our CSI Driver had an update that can cause permissions issues with mounting Volumes for certain deployments. We've seen this is especially prevalent with PostgreSQL deployments."

They proposed a workaround - "Run the command below and then redeploy your workloads"

-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--default-fstype=ext4"}]'

This worked for me and i hope it will help someone here.

Leedwing avatar Jan 21 '22 11:01 Leedwing

i also have this problem with an onprem kubeadm 1.23 deployment. i believe the permissions errors are related to fsGroupChangePolicy: https://kubernetes.io/blog/2020/12/14/kubernetes-release-1.20-fsgroupchangepolicy-fsgrouppolicy/

yee379 avatar Feb 28 '22 23:02 yee379

I have meet same issue. @jkatz

➜ /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ kubectl logs -n postgres-operator hippo-instance1-6cpp-0 postgres-startup
Initializing ...
::postgres-operator: uid::26
::postgres-operator: gid::26 65534
::postgres-operator: postgres path::/usr/pgsql-13/bin/postgres
::postgres-operator: postgres version::postgres (PostgreSQL) 13.5
::postgres-operator: config directory::/pgdata/pg13
::postgres-operator: data directory::/pgdata/pg13
install: cannot change permissions of ‘/pgdata/pg13’: No such file or directory

k8s version

➜ /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:57:25Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

pv and pvc

🍺 /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ kubectl get pv
NAME                     CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                                           STORAGECLASS   REASON   AGE
pg-backup-volume1        10Gi       RWO            Retain           Bound       postgres-operator/hippo-repo1                                           3h32m
postgres-data-volume-1   10Gi       RWO            Retain           Bound       postgres-operator/hippo-instance1-6cpp-pgdata                           3h32m
🍺 /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ kubectl get pvc -n postgres-operator
NAME                          STATUS   VOLUME                   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
hippo-instance1-6cpp-pgdata   Bound    postgres-data-volume-1   10Gi       RWO                           5m
hippo-repo1                   Bound    pg-backup-volume1        10Gi       RWO                           5m

postgres cluster spec

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: hippo
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0
  postgresVersion: 13 # help PGO track which major version of Postgres you are using
  openshift: false
  supplementalGroups:
    - 65534
  instances:
    - name: instance1
      # replicas: 1
      dataVolumeClaimSpec: # the storage that your Postgres instance will use
        volumeName: postgres-data-volume-1
        # storageClassName: manual
        # volumeMode: Retain
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.36-0
      repos:
      - name: repo1
        volume:
          volumeClaimSpec:
            volumeName: pg-backup-volume1
            # volumeMode: Retain
            # storageClassName: manual
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 1Gi

pv spec

➜ /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ cat pg-backup-volume.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pg-backup-volume1
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/var/postgres/backups"
➜ /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ cat pg-data-volume.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: postgres-data-volume-1
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/var/postgres/data"

microyahoo avatar Mar 03 '22 06:03 microyahoo

Hi there, I've got the same issue on OVH Public Cloud (postgres-startup container failed with install: cannot change permissions of ‘/pgdata/pg14’: No such file or directory). This article (https://docs.ovh.com/sg/en/kubernetes/persistentvolumes-permission-errors/) solved my issues. I've recreate both storage classes with the additional fsType: ext4 parameter.

TLDR: Recreate the StorageClass with parameters.fsType: ext4

stefan-kollmann-wogra avatar Mar 08 '22 11:03 stefan-kollmann-wogra

I am experiencing the same issue on my side on a Microk8s installation. I have tried the openshift: false method and added in a supplemental group id just to be sure but it's still not happy.

G-kodes avatar Apr 05 '22 07:04 G-kodes

@G-kodes What version of Kubernetes are you using kubectl version and what storage provider?

cbandy avatar Apr 06 '22 01:04 cbandy

I have this same issue on my minikube (3 nodes). It works however when i use single node installation of minikube

ac5tin avatar Apr 13 '22 10:04 ac5tin

I am having the same issue with Google Cloud GKE with Volumes can we reopen the issue @jkatz, it is still evidently a problem. I tried with openshift false and supplemental groups

edit

it seems with gcloud you need to make your volume like this:

  gcePersistentDisk:
    pdName: x
    fsType: ext4

and not


  csi:
    driver: pd.csi.storage.gke.io
    volumeHandle: x

mhaddon avatar Apr 26 '22 10:04 mhaddon

Tried this on Azure AKS with native storage classes. When I use azurefile-csi storage class, this issue occurs. When I use managed-csi storage class, this issue does not occur. So it suggests that this error occurs with file storage and not block storage. For database storage I would not use a file storage anyway.

digihunch avatar May 20 '22 02:05 digihunch

I stumbled upon this with RedHat's Code Ready Containers when going through the examples:

$ kubectl apply -f kustomize/postgres                              
postgrescluster.postgres-operator.crunchydata.com/hippo created
error: error validating "kustomize/postgres/kustomization.yaml": error validating data: [apiVersion not set, kind not set]; if you choose to ignore these errors, turn validation off with --validate=false

The logs:

$ kubectl logs hippo-instance1-9f6m-0 postgres-startup 
Initializing ...
::postgres-operator: uid::26
::postgres-operator: gid::26
::postgres-operator: postgres path::/usr/pgsql-14/bin/postgres
::postgres-operator: postgres version::postgres (PostgreSQL) 14.3
::postgres-operator: config directory::/pgdata/pg14
::postgres-operator: data directory::/pgdata/pg14
install: cannot create directory ‘/pgdata’: Permission denied

My environment:

$ oc version
Client Version: 4.10.14
Server Version: 4.10.14
Kubernetes Version: v1.23.5+b463d71

The operator detects that it's running on OpenShift but doesn't seem to attach an appropriate SCC to the hippo-instance Role.

razvan avatar Jul 05 '22 14:07 razvan

Having a similar issue on OpenShift 4.10 on AWS - init failed with

spec:
  openshift: true

Then I created a database with

spec:
  openshift: false
  supplementalGroups:
    - 65534

The database starts and the liveness, readiness probes become healthy. But when I inspect the database container, I see the following logs there:


2022-08-01 05:08:06,145 ERROR: failed to update leader lock
2022-08-01 05:08:06,208 INFO: not promoting because failed to update leader lock in DCS
2022-08-01 05:08:16,595 INFO: Lock owner: sample-db-instance1-r44d-0; I am sample-db-instance1-r44d-0
2022-08-01 05:08:16,641 ERROR: Permission denied
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 914, in _update_leader_with_retry
    return self._patch_or_create(self.leader_path, annotations, resource_version, ips=ips, retry=_retry)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 868, in _patch_or_create
    ret = retry(func, self._namespace, body) if retry else func(self._namespace, body)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 911, in _retry
    return retry(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/utils.py", line 334, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 468, in wrapper
    return getattr(self._core_v1_api, func)(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 404, in wrapper
    return self._api_client.call_api(method, path, headers, body, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 373, in call_api
    return self._handle_server_response(response, _preload_content)
  File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 203, in _handle_server_response
    raise k8s_client.rest.ApiException(http_resp=response)
patroni.dcs.kubernetes.K8sClient.rest.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '61dd4a2d-7e64-498a-975d-af2125df6312', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '2597447c-7dbc-4f9e-9288-55aed3f00fe0', 'X-Kubernetes-Pf-Prioritylevel-Uid': '88170dfc-6bab-423c-a687-8e1f7c17a72a', 'Date': 'Mon, 01 Aug 2022 05:08:16 GMT', 'Content-Length': '253'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"endpoints \\"sample-db-ha\\" is forbidden: endpoint address 10.131.31.247 is not allowed","reason":"Forbidden","details":{"name":"sample-db-ha","kind":"endpoints"},"code":403}\n'

Also, I dont see any scc applied to the database pods. (At least not the restricted scc)

rumeshmadhusanka avatar Aug 01 '22 05:08 rumeshmadhusanka

Any updates in Dec 2022 on this or any workarounds apart from downgrading K8s version to 1.21.5-00 , I just started learning PGO operator and hit the same issue.

VIPULKAM avatar Dec 15 '22 03:12 VIPULKAM

Any updates in Dec 2022 on this

The logs of the postgres-startup container have improved and are coming in the next release. The issues here are about file permissions set by storage providers.

@VIPULKAM What version of Kubernetes are you using kubectl version and what storage provider?

cbandy avatar Dec 15 '22 14:12 cbandy

Thanks for reaching out. I am new learner and this is local setup. I have tried kubectl and kubelet versions 1.21.5,1.23.12 and 1.25.4 so far.

Best regards

On Thu, Dec 15, 2022 at 8:42 AM Chris Bandy @.***> wrote:

Any updates in Dec 2022 on this

The logs of the postgres-startup have improved and are coming in the next release. The issues here are about file permissions set by storage providers.

@VIPULKAM https://github.com/VIPULKAM What version of Kubernetes are you using kubectl version and what storage provider?

— Reply to this email directly, view it on GitHub https://github.com/CrunchyData/postgres-operator/issues/2870#issuecomment-1353200604, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIE3EXYBPG4FV64LG7R7YGTWNMU5TANCNFSM5ITCZ4HA . You are receiving this because you were mentioned.Message ID: @.***>

VIPULKAM avatar Dec 15 '22 15:12 VIPULKAM

I am experiencing similar problems, different environments. I cannot get a Postgres cluster operational at all, hence I cannot migrate off of the v4.7.8 operator which has been working reliably for some time. I actually find this situation pretty distressing given that the CrunchyData operator is available as a subscription service in the OpenShift marketplace.

In most cases, we see the following error in the postgres-startup init container:

install: cannot change permissions of ‘/pgdata/pg14’: No such file or directory
stat: cannot statx '/pgdata/pg14': No such file or directory

If I used a shared filesystem class of storage (i.e. and NFS volume provided by ontap-nas or cephfs), set spec.openshift: false and grant the service account the hostmount-anyuid field, the init containers do run successfully, but the the database will never enter a ready state because of Permission Denied errors in patroni, as noted by @rumeshmadhusanka, above.

Component Product Versions tested
Postgres Operator CrunchyData 5.2, 5.3
Kubernetes Platform RedHat OpenShift 4.8, 4.9
Storage Provider NetApp Trident Astra 22.10.0, 23.1.0
Storage Provider OpenShift Container Storage 4.8
Storage Driver ontap-san, ontap-nas, ceph-rdb, cephfs
Setting Value
PostgresCluster spec.opensshift true, false
ServiceAccount scc restricted, hostmount-anyuid

jgregmac avatar Feb 10 '23 19:02 jgregmac

In most cases, we see the following error in the postgres-startup init container:

install: cannot change permissions of ‘/pgdata/pg14’: No such file or directory
stat: cannot statx '/pgdata/pg14': No such file or directory

The line just after these shows what permissions the storage provider sets. Something like:

drwxr-xr-x    0    0 /pgdata

The first lines of the log show what process IDs OpenShift has assigned. Something like:

::postgres-operator: uid::26
::postgres-operator: gid::26

Everything should work well in OpenShift with the restricted (or restricted-v2) SCC and spec.openshift omitted. What are the complete logs in that case?

cbandy avatar Feb 10 '23 19:02 cbandy

In most cases, we see the following error in the postgres-startup init container:

install: cannot change permissions of ‘/pgdata/pg14’: No such file or directory
stat: cannot statx '/pgdata/pg14': No such file or directory

The line just after these shows what permissions the storage provider sets. Something like:

drwxr-xr-x    0    0 /pgdata

The first lines of the log show what process IDs OpenShift has assigned. Something like:

::postgres-operator: uid::26
::postgres-operator: gid::26

Everything should work well in OpenShift with the restricted (or restricted-v2) SCC and spec.openshift omitted. What are the complete logs in that case?

If I omit .spec.openshift entirely, and do not set any scc memberships (thus assuming restricted), this is what I get for logs:

Initializing ...
::postgres-operator: uid::26
::postgres-operator: gid::26
::postgres-operator: postgres path::/usr/pgsql-14/bin/postgres
::postgres-operator: postgres version::postgres (PostgreSQL) 14.6
::postgres-operator: config directory::/pgdata/pg14
::postgres-operator: data directory::/pgdata/pg14
install: cannot change permissions of ‘/pgdata/pg14’: No such file or directory
stat: cannot statx '/pgdata/pg14': No such file or directory
drwxr-xr-x    0    0 /pgdata

Here is the manifest, in case it is of interest:

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: test-cluster
  namespace: default
spec:
  backups:
    pgbackrest:
      configuration:
        - secret:
            name: MY-SECRET
      global:
        repo1-retention-full: "14"
        repo1-retention-full-type: time
      repos:
        - name: repo1
          s3:
            bucket: REDACTED
            endpoint: REDACTED
            region: REDACTED
          schedules:
            full: 0 9 * * *
  instances:
    - dataVolumeClaimSpec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
        storageClassName: netapp-san-del
      name: instance1
      replicas: 1
      walVolumeClaimSpec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        storageClassName: netapp-san-del
  # openshift: true
  port: 5432
  postgresVersion: 14
  userInterface:
    pgAdmin:
      dataVolumeClaimSpec:
        accessModes:
          - ReadWriteOnce
        storageClassName: netapp-san-del
      replicas: 1
  users:
    - name: user1
      options: state1
      password:
        type: ASCII
    - databases:
        - state2
      name: user2
      password:
        type: ASCII

This is using PGO v5.3.0, OpenShift 4.9.54, and Trident v23.01.

I have had a look at the filesystem using a debug pod:
oc debug -n default -c postgres-startup --keep-init-containers=false --one-container=true [podNameHere]

We can see that the pgdata volume filesystem is owned by root:

$ ls -la
...
drwxr-xr-x.    3 root     root     4096 Feb 10 20:16 pgdata
drwxr-xr-x.    3 root     root     4096 Feb 10 20:16 pgwal

This root ownership issue is what made me think that I might be able to use the hostmount-anyuid scc with spec.openshift: false, but that really was not help. I expect that using the scc privileged with .spec.openshift: false definitely would get the job done, but that clearly is not the right way to go.

Previous comments have hinted that the root cause here is an "incorrect implementation" of standards by my CSI providers. That may be the case, but I have tried using two different mainstream enterprise CSI providers, and provisioning fails with both. I also will note that I am not having similar problems with any of the other workloads that we run in our cluster, and we run a fair number of other operators.

jgregmac avatar Feb 10 '23 20:02 jgregmac

It looks like you are using the default namespace to deploy your PostgresCluster. Per the OpenShift docs, this namespace should not be used to run pods or services:

You cannot assign a SCC to pods created in one of the default namespaces: default, kube-system, kube-public, openshift-node, openshift-infra, openshift. These namespaces should not be used for running pods or services.

https://docs.openshift.com/container-platform/4.9/authentication/managing-security-context-constraints.html#role-based-access-to-ssc_configuring-internal-oauth

andrewlecuyer avatar Feb 10 '23 22:02 andrewlecuyer