postgres-operator
postgres-operator copied to clipboard
postgres-startup init fail
Overview
When I create a postgres cluster, the instance crash on initialization. The bug happened only on kubernetes 1.22.
Environment
- Platform:
Kubernetes
on ovh managed kubernetes service - Platform Version: 1.22.2
- PGO Image Tag: ubi8-5.0.4-0
- Postgres Version: 13.5
- Storage: csi-cinder-classic
Steps to Reproduce
REPRO
- Install postgres-operator example with customize
- Install the postgres hippo cluster defined in the example
EXPECTED
- The postgres cluster starts correctely.
ACTUAL
The postgres-startup container failed to init the cluster
Initializing ...
::postgres-operator: uid::26
::postgres-operator: gid::26
::postgres-operator: postgres path::/usr/pgsql-13/bin/postgres
::postgres-operator: postgres version::postgres (PostgreSQL) 13.5
::postgres-operator: config directory::/pgdata/pg13
::postgres-operator: data directory::/pgdata/pg13
install: cannot change permissions of ‘/pgdata/pg13’: No such file or directory
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 40s default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
Normal Scheduled 38s default-scheduler Successfully assigned postgres-operator/hippo-instance1-mzs4-0 to node-b4422cc5-9dc5-4116-a4d5-a194f6171309
Normal SuccessfulAttachVolume 35s attachdetach-controller AttachVolume.Attach succeeded for volume "ovh-managed-kubernetes-uo1sfr-pvc-729b89ad-c0dd-48f9-847b-36e36d8cc8e6"
Normal Pulled 30s kubelet Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0" in 1.260460237s
Normal Pulled 28s kubelet Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0" in 1.232337581s
Normal Pulling 15s (x3 over 32s) kubelet Pulling image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0"
Normal Created 14s (x3 over 30s) kubelet Created container postgres-startup
Normal Started 14s (x3 over 30s) kubelet Started container postgres-startup
Normal Pulled 14s kubelet Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0" in 1.226427728s
Warning BackOff 13s (x3 over 27s) kubelet Back-off restarting failed container
Does your storage class require you to set up any supplemental groups, e.g.:
https://access.crunchydata.com/documentation/postgres-operator/v5/references/crd/#postgresclusterspec
There are some more notes about supplemental groups in the 5.0.3 release notes:
https://access.crunchydata.com/documentation/postgres-operator/v5/releases/5.0.3/
Just wanted to mention that I'm hitting the same issue, on OVH Managed Kubernetes as well.
Environment
Platform: Kubernetes on ovh managed kubernetes service Platform Version: 1.22.2 PGO Image Tag: ubi8-5.0.1-0 Postgres Version: 13.3 Storage: csi-cinder-high-speed
Logs (from postgres-startup container)
[postgres-startup] Initializing ...
[postgres-startup] ::postgres-operator: uid::26
[postgres-startup] ::postgres-operator: gid::26 65534
[postgres-startup] ::postgres-operator: postgres path::/usr/pgsql-13/bin/postgres
[postgres-startup] ::postgres-operator: postgres version::postgres (PostgreSQL) 13.3
[postgres-startup] ::postgres-operator: config directory::/pgdata/pg13
[postgres-startup] ::postgres-operator: data directory::/pgdata/pg13
[postgres-startup] install: cannot change permissions of ‘/pgdata/pg13’: No such file or directory
I have been trying to resolve this issue for about an hour now, without much progress. (I am not very experienced with the details of the postgres-operator, or even kubernetes in general, but am trying to learn)
Note that I just (earlier this night) updated my OVH Kubernetes cluster from 1.20 to 1.22, so as the OP said, this is likely the cause of the issue. (though I can't rule out that there are other contributing causes/changes, as I started the reset+update to try to get around a persistent-volume not-mounting issue I was having).
Okay, I reset + changed-version my cluster to 1.21 (1.21.5-0), and the error above is no longer being hit:
[postgres-startup] Initializing ...
[postgres-startup] ::postgres-operator: uid::26
[postgres-startup] ::postgres-operator: gid::26 65534
[postgres-startup] ::postgres-operator: postgres path::/usr/pgsql-13/bin/postgres
[postgres-startup] ::postgres-operator: postgres version::postgres (PostgreSQL) 13.3
[postgres-startup] ::postgres-operator: config directory::/pgdata/pg13
[postgres-startup] ::postgres-operator: data directory::/pgdata/pg13
[K8s EVENT: Pod debate-map-repo-host-0 (ns: postgres-operator)] Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-1" in 37.671615103s
[K8s EVENT: Pod debate-map-repo-host-0 (ns: postgres-operator)] Pulling image "registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-1"
[nss-wrapper-init] nss_wrapper: user exists
[nss-wrapper-init] nss_wrapper: group exists
[nss-wrapper-init] nss_wrapper: environment configured
[K8s EVENT: Pod debate-map-instance1-hfj5-0 (ns: postgres-operator)] Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres-ha:centos8-13.3-1" in 20.861051117s
[K8s EVENT: Pod debate-map-repo-host-0 (ns: postgres-operator)] Successfully pulled image "registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-1" in 7.632752328s
[K8s EVENT: Pod debate-map-instance1-hfj5-0 (ns: postgres-operator)] Pulling image "registry.developers.crunchydata.com/crunchydata/crunchy-postgres-ha:centos8-13.3-1"
Server listening on 0.0.0.0 port 2022.
Server listening on :: port 2022.
[...]
~So something about version 1.22 of Kubernetes is indeed causing the issue.~ (EDIT: Apparently it's more complicated than just this version change, as v1.21 also has issues [on a different cloud provider] for some of the commenters below.)
I spent a good 15 minutes searching through the v1.22+ changelog here, but did not notice anything likely to be the cause of the issue. Perhaps there is something that OVH changed internally about their provisioning for v1.22 nodes, beyond the changes to Kubernetes itself?
Anyway, if others are having the same issue (and are not using OVH Managed Kubernetes), please comment here so we can narrow down the source of the problem.
In the Operator log on startup (the pgo-...
Pod), could you see if there are any lines that indicate what spec.openshift
is set to?
I'm fairly confident this is similar to an issue that prompted #2897 and should be fixed in that.
The interim solution is to set the following in your PostgresCluster spec:
spec:
openshift: false
I just wanted to mention that I'm hitting the same issue, on Linode Managed Kubernetes service (LKE)as well. On both Kubernetes version 1.21.1 and 1.22.2
Environment Platform: Kubernetes on Linode managed kubernetes service (LKE) Platform Version: 1.21.1 and 1.22.2 PGO Image Tag: ubi8-5.0.3-0 Postgres Version: 13.4 Storage: csi-cinder-high-speed
Logs (from postgres-startup container)
Initializing ...
::postgres-operator: uid::26
::postgres-operator: gid::26
::postgres-operator: postgres path::/usr/pgsql-13/bin/postgres
::postgres-operator: postgres version::postgres (PostgreSQL) 13.4
::postgres-operator: config directory::/pgdata/pg13
::postgres-operator: data directory::/pgdata/pg13
install: cannot change permissions of ‘/pgdata/pg13’: No such file or directory
I tried with the suggestions of @jkatz configuring PostgresCluster like this
spec:
openshift: false
and then
spec:
supplementalGroups:
- 65534
and then with both
spec:
openshift: false
supplementalGroups:
- 65534
But nothing worked. Any idea?
I'm in the same situation as @Leedwing here: LKE (I tested 1.21 and 1.22) using both specifications (openshit and/or supplementalGroups), but neither worked.
Update
I deployed the postgres-operator-examples on DigitalOcean Kubernetes Service v1.20.11 and v1.21.1
Environment Platform: DigitalOcean Kubernetes service Platform Version: 1.20.11 and 1.21.5 PGO Image Tag: ubi8-5.0.3-0 Postgres Version: 13.4 Storage: csi-cinder-high-speed
AND EVERYTHING WORKED FINE!
I would say that on LKE, the issue might be caused by some internal behavior and probably related to permissions against block storage. Because also the monitoring component shipped with postgres-operator-examples that uses block storage cannot start on LKE due to the same kind of error: permission denied
- Grafana
GF_PATHS_DATA='/data/grafana/data' is not writable.
You may have issues with file permissions, more information here: http://docs.grafana.org/installation/docker/#migrate-to-v51-or-later
...
...
service init failed: failed to connect to database: mkdir /data/grafana: permission denied
- Prometheus
level=error ts=2021-12-27T19:36:47.147Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
- Alertmanager
level=error ts=2021-12-27T19:41:09.010Z caller=nflog.go:365 component=nflog msg="Running maintenance failed" err="open /alertmanager/nflog.73fec05415888652: permission denied"
level=info ts=2021-12-27T19:41:09.011Z caller=silence.go:389 component=silences msg="Running maintenance failed" err="open /alertmanager/silences.195c39cdee14e24: permission denied
I installed it a few weeks ago on LKE (1.21) and it was working (and I was surprised how easy it was), so I guess the problem started due to some changes in Postgres Operator.
@jisaitua I also got it working few weeks ago on LKE (1.21). It was also working on kubernetes 1.21.5 shipped with docker desktop locally and is still working. I've checked the change log https://developers-linode.netlify.app/changelog/cloud-manager/ of Linode and noticed that they have done some update related to block storage (i guess this might be the cause the permission issue). Another point why i guess the change might be at Linode side is that, i got CNAME Records created with the kubernetes External DNS instead of A Records as it was few weeks ago. And with the same scripts everything is working as expected on DigitalOcean.
PS: i'am using the same postgres-operator-examples as few weeks ago. I didn't pull the new changes.
PS: i'am using the same postgres-operator-examples as few weeks ago. I didn't pull the new changes.
@Leedwing Good point. If that's the case, then the problem looks like a Linode problem. Maybe @jkatz can help to find some way to solve it.
Good point. If that's the case, then the problem looks like a Linode problem.
Linode + OVH Cloud, yes; perhaps it's a common error (or complication for postgres-operator) that the managed-kubernetes providers are making, in response to some upcoming Kubernetes change or something.
Linode support responds to me saying: "I agree that this appears to be related to a change on our end. To be precise, our CSI Driver had an update that can cause permissions issues with mounting Volumes for certain deployments. We've seen this is especially prevalent with PostgreSQL deployments."
They proposed a workaround - "Run the command below and then redeploy your workloads"
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--default-fstype=ext4"}]'
This worked for me and i hope it will help someone here.
i also have this problem with an onprem kubeadm 1.23 deployment. i believe the permissions errors are related to fsGroupChangePolicy
: https://kubernetes.io/blog/2020/12/14/kubernetes-release-1.20-fsgroupchangepolicy-fsgrouppolicy/
I have meet same issue. @jkatz
➜ /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ kubectl logs -n postgres-operator hippo-instance1-6cpp-0 postgres-startup
Initializing ...
::postgres-operator: uid::26
::postgres-operator: gid::26 65534
::postgres-operator: postgres path::/usr/pgsql-13/bin/postgres
::postgres-operator: postgres version::postgres (PostgreSQL) 13.5
::postgres-operator: config directory::/pgdata/pg13
::postgres-operator: data directory::/pgdata/pg13
install: cannot change permissions of ‘/pgdata/pg13’: No such file or directory
k8s version
➜ /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:57:25Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
pv and pvc
🍺 /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pg-backup-volume1 10Gi RWO Retain Bound postgres-operator/hippo-repo1 3h32m
postgres-data-volume-1 10Gi RWO Retain Bound postgres-operator/hippo-instance1-6cpp-pgdata 3h32m
🍺 /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ kubectl get pvc -n postgres-operator
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
hippo-instance1-6cpp-pgdata Bound postgres-data-volume-1 10Gi RWO 5m
hippo-repo1 Bound pg-backup-volume1 10Gi RWO 5m
postgres cluster spec
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: hippo
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.5-0
postgresVersion: 13 # help PGO track which major version of Postgres you are using
openshift: false
supplementalGroups:
- 65534
instances:
- name: instance1
# replicas: 1
dataVolumeClaimSpec: # the storage that your Postgres instance will use
volumeName: postgres-data-volume-1
# storageClassName: manual
# volumeMode: Retain
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 1Gi
backups:
pgbackrest:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.36-0
repos:
- name: repo1
volume:
volumeClaimSpec:
volumeName: pg-backup-volume1
# volumeMode: Retain
# storageClassName: manual
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 1Gi
pv spec
➜ /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ cat pg-backup-volume.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pg-backup-volume1
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/var/postgres/backups"
➜ /root/go/src/github.com/CrunchyData/postgres-operator-examples ☞ git:(main) ✗ cat pg-data-volume.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: postgres-data-volume-1
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/var/postgres/data"
Hi there,
I've got the same issue on OVH Public Cloud (postgres-startup container failed with install: cannot change permissions of ‘/pgdata/pg14’: No such file or directory
). This article (https://docs.ovh.com/sg/en/kubernetes/persistentvolumes-permission-errors/) solved my issues. I've recreate both storage classes with the additional fsType: ext4
parameter.
TLDR: Recreate the StorageClass with parameters.fsType: ext4
I am experiencing the same issue on my side on a Microk8s installation. I have tried the openshift: false
method and added in a supplemental group id just to be sure but it's still not happy.
@G-kodes What version of Kubernetes are you using kubectl version
and what storage provider?
I have this same issue on my minikube (3 nodes). It works however when i use single node installation of minikube
I am having the same issue with Google Cloud GKE with Volumes can we reopen the issue @jkatz, it is still evidently a problem. I tried with openshift false and supplemental groups
edit
it seems with gcloud you need to make your volume like this:
gcePersistentDisk:
pdName: x
fsType: ext4
and not
csi:
driver: pd.csi.storage.gke.io
volumeHandle: x
Tried this on Azure AKS with native storage classes. When I use azurefile-csi storage class, this issue occurs. When I use managed-csi storage class, this issue does not occur. So it suggests that this error occurs with file storage and not block storage. For database storage I would not use a file storage anyway.
I stumbled upon this with RedHat's Code Ready Containers when going through the examples:
$ kubectl apply -f kustomize/postgres
postgrescluster.postgres-operator.crunchydata.com/hippo created
error: error validating "kustomize/postgres/kustomization.yaml": error validating data: [apiVersion not set, kind not set]; if you choose to ignore these errors, turn validation off with --validate=false
The logs:
$ kubectl logs hippo-instance1-9f6m-0 postgres-startup
Initializing ...
::postgres-operator: uid::26
::postgres-operator: gid::26
::postgres-operator: postgres path::/usr/pgsql-14/bin/postgres
::postgres-operator: postgres version::postgres (PostgreSQL) 14.3
::postgres-operator: config directory::/pgdata/pg14
::postgres-operator: data directory::/pgdata/pg14
install: cannot create directory ‘/pgdata’: Permission denied
My environment:
$ oc version
Client Version: 4.10.14
Server Version: 4.10.14
Kubernetes Version: v1.23.5+b463d71
The operator detects that it's running on OpenShift but doesn't seem to attach an appropriate SCC to the hippo-instance
Role.
Having a similar issue on OpenShift 4.10 on AWS - init failed with
spec: openshift: true
Then I created a database with
spec: openshift: false supplementalGroups: - 65534
The database starts and the liveness, readiness probes become healthy. But when I inspect the database container, I see the following logs there:
2022-08-01 05:08:06,145 ERROR: failed to update leader lock 2022-08-01 05:08:06,208 INFO: not promoting because failed to update leader lock in DCS 2022-08-01 05:08:16,595 INFO: Lock owner: sample-db-instance1-r44d-0; I am sample-db-instance1-r44d-0 2022-08-01 05:08:16,641 ERROR: Permission denied Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 914, in _update_leader_with_retry return self._patch_or_create(self.leader_path, annotations, resource_version, ips=ips, retry=_retry) File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 868, in _patch_or_create ret = retry(func, self._namespace, body) if retry else func(self._namespace, body) File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 911, in _retry return retry(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/patroni/utils.py", line 334, in __call__ return func(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 468, in wrapper return getattr(self._core_v1_api, func)(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 404, in wrapper return self._api_client.call_api(method, path, headers, body, **kwargs) File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 373, in call_api return self._handle_server_response(response, _preload_content) File "/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 203, in _handle_server_response raise k8s_client.rest.ApiException(http_resp=response) patroni.dcs.kubernetes.K8sClient.rest.ApiException: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Audit-Id': '61dd4a2d-7e64-498a-975d-af2125df6312', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '2597447c-7dbc-4f9e-9288-55aed3f00fe0', 'X-Kubernetes-Pf-Prioritylevel-Uid': '88170dfc-6bab-423c-a687-8e1f7c17a72a', 'Date': 'Mon, 01 Aug 2022 05:08:16 GMT', 'Content-Length': '253'}) HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"endpoints \\"sample-db-ha\\" is forbidden: endpoint address 10.131.31.247 is not allowed","reason":"Forbidden","details":{"name":"sample-db-ha","kind":"endpoints"},"code":403}\n'
Also, I dont see any scc applied to the database pods. (At least not the restricted
scc)
Any updates in Dec 2022 on this or any workarounds apart from downgrading K8s version to 1.21.5-00 , I just started learning PGO operator and hit the same issue.
Any updates in Dec 2022 on this
The logs of the postgres-startup
container have improved and are coming in the next release. The issues here are about file permissions set by storage providers.
@VIPULKAM What version of Kubernetes are you using kubectl version
and what storage provider?
Thanks for reaching out. I am new learner and this is local setup. I have tried kubectl and kubelet versions 1.21.5,1.23.12 and 1.25.4 so far.
Best regards
On Thu, Dec 15, 2022 at 8:42 AM Chris Bandy @.***> wrote:
Any updates in Dec 2022 on this
The logs of the postgres-startup have improved and are coming in the next release. The issues here are about file permissions set by storage providers.
@VIPULKAM https://github.com/VIPULKAM What version of Kubernetes are you using kubectl version and what storage provider?
— Reply to this email directly, view it on GitHub https://github.com/CrunchyData/postgres-operator/issues/2870#issuecomment-1353200604, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIE3EXYBPG4FV64LG7R7YGTWNMU5TANCNFSM5ITCZ4HA . You are receiving this because you were mentioned.Message ID: @.***>
I am experiencing similar problems, different environments. I cannot get a Postgres cluster operational at all, hence I cannot migrate off of the v4.7.8 operator which has been working reliably for some time. I actually find this situation pretty distressing given that the CrunchyData operator is available as a subscription service in the OpenShift marketplace.
In most cases, we see the following error in the postgres-startup init container:
install: cannot change permissions of ‘/pgdata/pg14’: No such file or directory
stat: cannot statx '/pgdata/pg14': No such file or directory
If I used a shared filesystem class of storage (i.e. and NFS volume provided by ontap-nas or cephfs), set spec.openshift: false
and grant the service account the hostmount-anyuid
field, the init containers do run successfully, but the the database will never enter a ready state because of Permission Denied
errors in patroni, as noted by @rumeshmadhusanka, above.
Component | Product | Versions tested |
---|---|---|
Postgres Operator | CrunchyData | 5.2, 5.3 |
Kubernetes Platform | RedHat OpenShift | 4.8, 4.9 |
Storage Provider | NetApp Trident Astra | 22.10.0, 23.1.0 |
Storage Provider | OpenShift Container Storage | 4.8 |
Storage Driver | ontap-san, ontap-nas, ceph-rdb, cephfs |
Setting | Value |
---|---|
PostgresCluster spec.opensshift | true, false |
ServiceAccount scc | restricted, hostmount-anyuid |
In most cases, we see the following error in the postgres-startup init container:
install: cannot change permissions of ‘/pgdata/pg14’: No such file or directory stat: cannot statx '/pgdata/pg14': No such file or directory
The line just after these shows what permissions the storage provider sets. Something like:
drwxr-xr-x 0 0 /pgdata
The first lines of the log show what process IDs OpenShift has assigned. Something like:
::postgres-operator: uid::26
::postgres-operator: gid::26
Everything should work well in OpenShift with the restricted
(or restricted-v2
) SCC and spec.openshift
omitted. What are the complete logs in that case?
In most cases, we see the following error in the postgres-startup init container:
install: cannot change permissions of ‘/pgdata/pg14’: No such file or directory stat: cannot statx '/pgdata/pg14': No such file or directory
The line just after these shows what permissions the storage provider sets. Something like:
drwxr-xr-x 0 0 /pgdata
The first lines of the log show what process IDs OpenShift has assigned. Something like:
::postgres-operator: uid::26 ::postgres-operator: gid::26
Everything should work well in OpenShift with the
restricted
(orrestricted-v2
) SCC andspec.openshift
omitted. What are the complete logs in that case?
If I omit .spec.openshift
entirely, and do not set any scc memberships (thus assuming restricted
), this is what I get for logs:
Initializing ...
::postgres-operator: uid::26
::postgres-operator: gid::26
::postgres-operator: postgres path::/usr/pgsql-14/bin/postgres
::postgres-operator: postgres version::postgres (PostgreSQL) 14.6
::postgres-operator: config directory::/pgdata/pg14
::postgres-operator: data directory::/pgdata/pg14
install: cannot change permissions of ‘/pgdata/pg14’: No such file or directory
stat: cannot statx '/pgdata/pg14': No such file or directory
drwxr-xr-x 0 0 /pgdata
Here is the manifest, in case it is of interest:
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: test-cluster
namespace: default
spec:
backups:
pgbackrest:
configuration:
- secret:
name: MY-SECRET
global:
repo1-retention-full: "14"
repo1-retention-full-type: time
repos:
- name: repo1
s3:
bucket: REDACTED
endpoint: REDACTED
region: REDACTED
schedules:
full: 0 9 * * *
instances:
- dataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: netapp-san-del
name: instance1
replicas: 1
walVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: netapp-san-del
# openshift: true
port: 5432
postgresVersion: 14
userInterface:
pgAdmin:
dataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
storageClassName: netapp-san-del
replicas: 1
users:
- name: user1
options: state1
password:
type: ASCII
- databases:
- state2
name: user2
password:
type: ASCII
This is using PGO v5.3.0, OpenShift 4.9.54, and Trident v23.01.
I have had a look at the filesystem using a debug pod:
oc debug -n default -c postgres-startup --keep-init-containers=false --one-container=true [podNameHere]
We can see that the pgdata
volume filesystem is owned by root:
$ ls -la
...
drwxr-xr-x. 3 root root 4096 Feb 10 20:16 pgdata
drwxr-xr-x. 3 root root 4096 Feb 10 20:16 pgwal
This root ownership issue is what made me think that I might be able to use the hostmount-anyuid
scc with spec.openshift: false
, but that really was not help. I expect that using the scc privileged
with .spec.openshift: false
definitely would get the job done, but that clearly is not the right way to go.
Previous comments have hinted that the root cause here is an "incorrect implementation" of standards by my CSI providers. That may be the case, but I have tried using two different mainstream enterprise CSI providers, and provisioning fails with both. I also will note that I am not having similar problems with any of the other workloads that we run in our cluster, and we run a fair number of other operators.
It looks like you are using the default
namespace to deploy your PostgresCluster
. Per the OpenShift docs, this namespace should not be used to run pods or services:
You cannot assign a SCC to pods created in one of the default namespaces: default, kube-system, kube-public, openshift-node, openshift-infra, openshift. These namespaces should not be used for running pods or services.
https://docs.openshift.com/container-platform/4.9/authentication/managing-security-context-constraints.html#role-based-access-to-ssc_configuring-internal-oauth