postgres-operator
postgres-operator copied to clipboard
Permission issue with tls cert after upgrading to 1.5
After upgrading 1.4 -> 1.5
my cluster couldn't init.
I've checked that certs are mounted into container, and I'm able to read them as a root user. Not sure where to look next.
Logs from cluster pods:
...
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
/usr/lib/postgresql/12/bin/pg_ctl -D /home/postgres/pgdata/pgroot/data -l logfile start
2020-05-20 14:40:25 UTC [301]: [1-1] 5ec54159.12d 0 FATAL: could not load server certificate file "/tls/tls.crt": Permission denied
2020-05-20 14:40:25 UTC [301]: [2-1] 5ec54159.12d 0 LOG: database system is shut down
2020-05-20 14:40:25,108 INFO: postmaster pid=301
/var/run/postgresql:5432 - no response
2020-05-20 14:40:25,121 INFO: removing initialize key after failed attempt to bootstrap the cluster
2020-05-20 14:40:25,137 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2020-05-20-14-40-25
2020-05-20 14:40:25,587 INFO: Lock owner: None; I am grafana-cluster-0
Traceback (most recent call last):
File "/usr/local/bin/patroni", line 11, in <module>
load_entry_point('patroni==1.6.5', 'console_scripts', 'patroni')()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 235, in main
return patroni_main()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 199, in patroni_main
patroni.run()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 135, in run
logger.info(self.ha.run_cycle())
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1370, in run_cycle
info = self._run_cycle()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1277, in _run_cycle
return self.post_bootstrap()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1173, in post_bootstrap
self.cancel_initialization()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1168, in cancel_initialization
raise PatroniException('Failed to bootstrap cluster')
patroni.exceptions.PatroniException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
/run/service/patroni: exceeded maximum number of restarts 5
stopping /run/service/patroni
timeout: finish: .: (pid 303) 10s, want down
Permissions in container:
root@grafana-cluster-0:/home/postgres# ls -la /tls
total 4
drwxrwxrwt 3 root root 120 May 20 15:16 .
drwxr-xr-x 1 root root 4096 May 20 15:17 ..
drwxr-xr-x 2 root root 80 May 20 15:16 ..2020_05_20_15_16_36.928412376
lrwxrwxrwx 1 root root 31 May 20 15:16 ..data -> ..2020_05_20_15_16_36.928412376
lrwxrwxrwx 1 root root 14 May 20 15:16 tls.crt -> ..data/tls.crt
lrwxrwxrwx 1 root root 14 May 20 15:16 tls.key -> ..data/tls.key
@nmlc did you upgrade by editing the deployment? Are you also using the latest Spilo image? Have you updated the cluster roles, too?
I've upgraded via chart by simply changing version. So yes, I use latest spilo image and I've updated the cluster roles too.
Identical issue. Installed from scratch (1.5.0) and getting exactly the same permission issue. Permissions seem to be set correctly and certificate files can be read by both root
as well as postgres
. Any ideas?
operator: registry.opensource.zalan.do/acid/postgres-operator:v1.5.0
spilo: registry.opensource.zalan.do/acid/spilo-12:1.6-p3
EDIT: apparently by using spilo_fsgroup: "103"
the keys now can be read.
Hey! Just came here to file this exact issue! We upgraded our operators to 1.5, worked through a bug in the helm chart re: quotes around the connection pooler integer settings, and then started applying our letsencrypt certs, and ran into this exact problem, by default the TLS certs are mounted with incorrect permissions.
-
solution: Adding
spiloFSGroup: 103
to our manifests mounts the certmanager secret with the appropriate permissions and postgres can then start, instead of being stuck in a boot loop erroring out on the tls permissions. -
here be dragons: Separate bug: Adding spiloFSGroup to the manifest alone does not trigger a recreation of the statefulset, we have to change an unrelated property to get the operator to recreate the statefulset with the proper securityContext.
-
cascading failure: In Addition, adding the certs and the security context to the manifest at the same time doesn't work either: I /think/ it's an order of operations: the operator doesn't watch the spiloFSGroup property, so the statefulset isn't updated, but it DOES add the new secret to the pods, so the rolling upgrade starts, but then errors out because the first pod won't complete the upgrade. At this point it's easiest to roll back the TLS changes, poke at the operator until it upgrades the sts, and THEN add TLS once you've verified that the statefulset and the pods have the proper security context.
https://github.com/zalando/postgres-operator/pull/704 This PR sorta fixes the issue, but was closed because there's lots of considerations around file permissions. I think this is worth looking into as TLS is important. -- getting some sort of guide around integrating letsencrypt and certmanager, and having the operator be aware when a cert was updated and automatically do cluster failovers would be great too.
This is the expected output of a properly configured statefulset, if you don't see the bottom securityContext with the fsgroup, then the sts isn't configured correctly and adding the tls cert will fail.
kubectl -n pgtest get sts -o yaml | grep -C 2 security
cpu: 500m
memory: 512Mi
securityContext:
privileged: false
procMount: Default
--
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 103
serviceAccount: pgtest-postgres-pod
can confirm that adding spiloFSGroup AND a tls to the postgres manifest doesn't work, but adding the spiloFSGroup, and then, a short while later, adding the TLS portion DOES work. So when both changes are present in the manifest something gets missed.
Side note: we are also applying cipher configuration based off of the mozilla tls generator. https://ssl-config.mozilla.org/#server=postgresql&version=11&config=intermediate&openssl=1.1.1d&guideline=5.4
it looks like this in the manifest:
spec:
teamId: "pgtest"
enableLogicalBackup: true
enableReplicaLoadBalancer: true
postgresql:
version: "11"
parameters:
max_connections: "101"
ssl_ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384"
@blyry thanks for the detailed feedback (also thanks @nielsmeima for giving the hint). Hm, that sounds like had a working default in place before v1.5.0, but I cannot think of where it has changed on the operator side. Maybe something in Spilo...
Hello everyone, we will have a look into what went wrong here and if we can be more helpful.
Thank you for the detailed reports and any additional input is great.
Also if someone knows how to build and end to end test for this using kind that would be very helpful to make sure future releases have this feature covered.
In regards to toggling securityContext
(containing the FS Group) not triggering a rolling update - that sounds like a bug to be resolved.
- [ ] honor changes in securityContext and propagate on changes
Is there a workaround besides downgrading to 1.4?
@dajudge First set e.g. resources
on the postgres resource while you have spiloFSGroup: 103
, wait for it to stabalise, then set the TLS setting, then wait for it to stabilise.
Ping, what's the status of this?
app-analytics-db-1 postgres 2021-02-08 19:20:00 UTC [86]: [1-1] 60218ee0.56 0 FATAL: could not load server certificate file "/tls/tls.crt": Permission denied
app-analytics-db-1 postgres 2021-02-08 19:20:00 UTC [86]: [2-1] 60218ee0.56 0 LOG: database system is shut down
What command to run to manually make the secondary replica come back online? Rolling back the change causes both to be restarted, which is fine, but not what I want.
Testing this on docker-desktop with a fresh cluster correctly rotates the certs.
The STS in prod has the correct securityContext: { fsGroup: 103 }
when checking, but inside the pod in prod:
# id
uid=0(root) gid=0(root) groups=0(root),1337
# cd /tls
# ls -lah
total 4.0K
drwxrwsrwt 3 root 1337 140 Feb 8 19:19 .
drwxr-xr-x 1 root root 4.0K Feb 8 19:19 ..
drwxr-sr-x 2 root 1337 100 Feb 8 19:19 ..2021_02_08_19_19_29.118338671
lrwxrwxrwx 1 root root 13 Feb 8 19:19 ca.crt -> ..data/ca.crt
lrwxrwxrwx 1 root root 31 Feb 8 19:19 ..data -> ..2021_02_08_19_19_29.118338671
lrwxrwxrwx 1 root root 14 Feb 8 19:19 tls.crt -> ..data/tls.crt
lrwxrwxrwx 1 root root 14 Feb 8 19:19 tls.key -> ..data/tls.key
It's almost like the postgres
user isn't running with the supplemental group 103
, BUT:
# getent group postgres
postgres:x:103:
in prod, so it should work. And the permissions are 777 on the TLS crt and key... Why isn't this working?
This is happening for me even for fresh installation- on 1.5.0
...
![image](https://user-images.githubusercontent.com/193115/127004955-feb99005-73ba-4d09-acac-d3f890156f3f.png)
Seems like it's not enough to configure according to the docs and to the above; you also have to ensure you're not running the sts as root.
The problem with this is that the latest Spilo operator 1.6.3 simply overwrites these despite setting them as spilo_runasgroup
and spilo_runasuser
in the global operator configuration... @FxKu Time to release a new operator?
It took me about 2 days to get SSL running. It took a lot of trial and error and it would have been unimaginable to try and migrate an existing productive cluster to SSL. I have deleted all data, all CRDs, all manifests about 50 times... before succeeding.
What worked in the end:
- k8s: v1.22.2-3+9ad9ee77396805 with all container images set to their defaults
- postgres-operator: 1.7.1 with all container images set to their defaults
- cert-manager: 1.6.0
- nginx-ingress
- letsencrypt
- first make sure cert-manger works and gets its (HTTP/S) certificate
- make sure you do not have any postgres-operator CRDs installed, no postgres manifest, nothing like that
- install CRDs and the works as shown in postgres-operator's Quickstart documentation
- create a "kind: Postgres" manifest with:
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: db-cluster
namespace: db-namespace
spec:
spiloFSGroup: 103
# DO NOT CONFIGURE TLS!!!
[...etc...]
As @haf wrote (thanks a lot for that workaround/hint @haf !): wait for the cluster to stabilize. Which means: get the logs of each of the pods that contain the Postgres processes and wait until the logs say:
... INFO: no action. I am (...) the leader with the lock
respectively
... INFO: no action. I am a secondary (...) and following a leader (..)
Once they are all fine like shown above add the TLS settings:
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: db-cluster
namespace: db-namespace
spec:
spiloFSGroup: 103
tls:
# whatever the secret's name is.
secretName: letsencrypt-production
# I am getting the certificate via cert-manager
# I only succeeded getting cert-manager to work with a `ClusterIssuer` config. It would not work with an `Issuer`config.
[...etc...]
and reapply the manifest.
Possibly, in order for the postgres pods to be able to use the certficate/secret you need to enable:
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-operator
data:
enable_cross_namespace_secret: "true"
I am not sure if that's strictly required, but I'm done with trial and error now that I have a running setup.
If you are reading this and you have a deeper understanding of what exactly spilioFSGroup
does, then future users would be helped if you could give the docu some love. Currently it only says:
"OpenShift allocates the users and groups dynamically (based on scc), and their range is different in every namespace. Due to this dynamic behaviour, it's not trivial to know at deploy time the uid/gid of the user in the cluster. Therefore, instead of using a global spilo_fsgroup setting, use the spiloFSGroup field per Postgres cluster."
... where the user needs to somehow figure out by himself what manifest is being meant by "Postgres cluster" and where exaclty the spilioFSGroup
should be set (I found out by searching Github and finding this issue) and what the value of spilioFSGroup
is supposed to be and that this in fact doesn't only apply to OpenShift but also to vanilla k8s clusters and that in that context SCC is meaningless (?).
Closing words: thanks a lot postgres/patroni/postgres-operator/k8s/zalando team(s), truly awesome work: <3 !!!! And to everybody that's contributing!!!
Just a question, where did you guys pull the number 103 from? Is it the group uid set in the Spilo image?
@MatteoGioioso
Just a question, where did you guys pull the number 103 from? Is it the group uid set in the Spilo image?
From the quoted docu
- create database with
postgresql
resource - applications cant connect because ssl is enabled, and they dont trust the cert
- create a self-signed issuer per https://www.crunchydata.com/blog/using-cert-manager-to-deploy-tls-for-postgres-on-kubernetes (i wish zolando postgres-operator had these instructions :heart:)
- modify
postgresql
resource to use cert secret - add
spec.spiloFSGroup: 103
topostgresql
resource
and magically things work!
this took a lot of running around.
i think the documentation could be increased in this area. https://postgres-operator.readthedocs.io/en/latest/user/#custom-tls-certificates is a start but
if applications are going to be fussy because of forced tls using an invalid certificate, there should be more setup instructions to facilitate good communication between clients and the new database.
@travnewmatic thanks for the feedback. Maybe you link to increase our docs in a PR? :smiley:
@travnewmatic thanks for the feedback. Maybe you link to increase our docs in a PR? smiley
Shouldn't this be fixed instead? I hit this roadblock, and was frustrated before I managed to find this thread. Luckily turning SSL off is a temporary, but viable, solution for us. Please fix, and many thanks for your amazing work!
Has this been fixed?
We did not receive any new reports on this topic. Seems running now for most of you. So I'm closing it.