postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

Permission issue with tls cert after upgrading to 1.5

Open nmlc opened this issue 4 years ago • 22 comments

After upgrading 1.4 -> 1.5 my cluster couldn't init.

I've checked that certs are mounted into container, and I'm able to read them as a root user. Not sure where to look next.

Logs from cluster pods:

...
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    /usr/lib/postgresql/12/bin/pg_ctl -D /home/postgres/pgdata/pgroot/data -l logfile start

2020-05-20 14:40:25 UTC [301]: [1-1] 5ec54159.12d 0     FATAL:  could not load server certificate file "/tls/tls.crt": Permission denied
2020-05-20 14:40:25 UTC [301]: [2-1] 5ec54159.12d 0     LOG:  database system is shut down
2020-05-20 14:40:25,108 INFO: postmaster pid=301
/var/run/postgresql:5432 - no response
2020-05-20 14:40:25,121 INFO: removing initialize key after failed attempt to bootstrap the cluster
2020-05-20 14:40:25,137 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2020-05-20-14-40-25
2020-05-20 14:40:25,587 INFO: Lock owner: None; I am grafana-cluster-0
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 11, in <module>
    load_entry_point('patroni==1.6.5', 'console_scripts', 'patroni')()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 235, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 199, in patroni_main
    patroni.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 135, in run
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1370, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1277, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1173, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1168, in cancel_initialization
    raise PatroniException('Failed to bootstrap cluster')
patroni.exceptions.PatroniException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
/run/service/patroni: exceeded maximum number of restarts 5
stopping /run/service/patroni
timeout: finish: .: (pid 303) 10s, want down

Permissions in container:

root@grafana-cluster-0:/home/postgres# ls -la /tls
total 4
drwxrwxrwt 3 root root  120 May 20 15:16 .
drwxr-xr-x 1 root root 4096 May 20 15:17 ..
drwxr-xr-x 2 root root   80 May 20 15:16 ..2020_05_20_15_16_36.928412376
lrwxrwxrwx 1 root root   31 May 20 15:16 ..data -> ..2020_05_20_15_16_36.928412376
lrwxrwxrwx 1 root root   14 May 20 15:16 tls.crt -> ..data/tls.crt
lrwxrwxrwx 1 root root   14 May 20 15:16 tls.key -> ..data/tls.key

nmlc avatar May 20 '20 15:05 nmlc

@nmlc did you upgrade by editing the deployment? Are you also using the latest Spilo image? Have you updated the cluster roles, too?

FxKu avatar May 22 '20 11:05 FxKu

I've upgraded via chart by simply changing version. So yes, I use latest spilo image and I've updated the cluster roles too.

nmlc avatar May 22 '20 11:05 nmlc

Identical issue. Installed from scratch (1.5.0) and getting exactly the same permission issue. Permissions seem to be set correctly and certificate files can be read by both root as well as postgres. Any ideas?

operator: registry.opensource.zalan.do/acid/postgres-operator:v1.5.0 spilo: registry.opensource.zalan.do/acid/spilo-12:1.6-p3

EDIT: apparently by using spilo_fsgroup: "103" the keys now can be read.

nielsmeima avatar Jun 01 '20 13:06 nielsmeima

Hey! Just came here to file this exact issue! We upgraded our operators to 1.5, worked through a bug in the helm chart re: quotes around the connection pooler integer settings, and then started applying our letsencrypt certs, and ran into this exact problem, by default the TLS certs are mounted with incorrect permissions.

  1. solution: Adding spiloFSGroup: 103 to our manifests mounts the certmanager secret with the appropriate permissions and postgres can then start, instead of being stuck in a boot loop erroring out on the tls permissions.

  2. here be dragons: Separate bug: Adding spiloFSGroup to the manifest alone does not trigger a recreation of the statefulset, we have to change an unrelated property to get the operator to recreate the statefulset with the proper securityContext.

  3. cascading failure: In Addition, adding the certs and the security context to the manifest at the same time doesn't work either: I /think/ it's an order of operations: the operator doesn't watch the spiloFSGroup property, so the statefulset isn't updated, but it DOES add the new secret to the pods, so the rolling upgrade starts, but then errors out because the first pod won't complete the upgrade. At this point it's easiest to roll back the TLS changes, poke at the operator until it upgrades the sts, and THEN add TLS once you've verified that the statefulset and the pods have the proper security context.

https://github.com/zalando/postgres-operator/pull/704 This PR sorta fixes the issue, but was closed because there's lots of considerations around file permissions. I think this is worth looking into as TLS is important. -- getting some sort of guide around integrating letsencrypt and certmanager, and having the operator be aware when a cert was updated and automatically do cluster failovers would be great too.

This is the expected output of a properly configured statefulset, if you don't see the bottom securityContext with the fsgroup, then the sts isn't configured correctly and adding the tls cert will fail. kubectl -n pgtest get sts -o yaml | grep -C 2 security

              cpu: 500m
              memory: 512Mi
          securityContext:
            privileged: false
            procMount: Default
--
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext:
          fsGroup: 103
        serviceAccount: pgtest-postgres-pod

blyry avatar Jun 02 '20 16:06 blyry

can confirm that adding spiloFSGroup AND a tls to the postgres manifest doesn't work, but adding the spiloFSGroup, and then, a short while later, adding the TLS portion DOES work. So when both changes are present in the manifest something gets missed.

Side note: we are also applying cipher configuration based off of the mozilla tls generator. https://ssl-config.mozilla.org/#server=postgresql&version=11&config=intermediate&openssl=1.1.1d&guideline=5.4

it looks like this in the manifest:

spec:
  teamId: "pgtest"
  enableLogicalBackup: true
  enableReplicaLoadBalancer: true
  postgresql:
    version: "11"
    parameters:
      max_connections: "101"
      ssl_ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384"

blyry avatar Jun 02 '20 17:06 blyry

@blyry thanks for the detailed feedback (also thanks @nielsmeima for giving the hint). Hm, that sounds like had a working default in place before v1.5.0, but I cannot think of where it has changed on the operator side. Maybe something in Spilo...

FxKu avatar Jun 03 '20 17:06 FxKu

Hello everyone, we will have a look into what went wrong here and if we can be more helpful.

Thank you for the detailed reports and any additional input is great.

Also if someone knows how to build and end to end test for this using kind that would be very helpful to make sure future releases have this feature covered.

Jan-M avatar Jun 04 '20 09:06 Jan-M

In regards to toggling securityContext (containing the FS Group) not triggering a rolling update - that sounds like a bug to be resolved.

  • [ ] honor changes in securityContext and propagate on changes

Jan-M avatar Jun 04 '20 10:06 Jan-M

Is there a workaround besides downgrading to 1.4?

dajudge avatar Oct 06 '20 13:10 dajudge

@dajudge First set e.g. resources on the postgres resource while you have spiloFSGroup: 103, wait for it to stabalise, then set the TLS setting, then wait for it to stabilise.

haf avatar Dec 25 '20 13:12 haf

Ping, what's the status of this?

haf avatar Feb 08 '21 19:02 haf

app-analytics-db-1 postgres 2021-02-08 19:20:00 UTC [86]: [1-1] 60218ee0.56 0     FATAL:  could not load server certificate file "/tls/tls.crt": Permission denied
app-analytics-db-1 postgres 2021-02-08 19:20:00 UTC [86]: [2-1] 60218ee0.56 0     LOG:  database system is shut down

What command to run to manually make the secondary replica come back online? Rolling back the change causes both to be restarted, which is fine, but not what I want.

Testing this on docker-desktop with a fresh cluster correctly rotates the certs.

The STS in prod has the correct securityContext: { fsGroup: 103 } when checking, but inside the pod in prod:

# id
uid=0(root) gid=0(root) groups=0(root),1337
# cd /tls
# ls -lah
total 4.0K
drwxrwsrwt 3 root 1337  140 Feb  8 19:19 .
drwxr-xr-x 1 root root 4.0K Feb  8 19:19 ..
drwxr-sr-x 2 root 1337  100 Feb  8 19:19 ..2021_02_08_19_19_29.118338671
lrwxrwxrwx 1 root root   13 Feb  8 19:19 ca.crt -> ..data/ca.crt
lrwxrwxrwx 1 root root   31 Feb  8 19:19 ..data -> ..2021_02_08_19_19_29.118338671
lrwxrwxrwx 1 root root   14 Feb  8 19:19 tls.crt -> ..data/tls.crt
lrwxrwxrwx 1 root root   14 Feb  8 19:19 tls.key -> ..data/tls.key

It's almost like the postgres user isn't running with the supplemental group 103, BUT:

#  getent group postgres
postgres:x:103:

in prod, so it should work. And the permissions are 777 on the TLS crt and key... Why isn't this working?

haf avatar Feb 08 '21 19:02 haf

This is happening for me even for fresh installation- on 1.5.0

tusharbhasme avatar May 16 '21 12:05 tusharbhasme

...

haf avatar Jul 26 '21 11:07 haf

image

Seems like it's not enough to configure according to the docs and to the above; you also have to ensure you're not running the sts as root.

haf avatar Jul 26 '21 14:07 haf

The problem with this is that the latest Spilo operator 1.6.3 simply overwrites these despite setting them as spilo_runasgroup and spilo_runasuser in the global operator configuration... @FxKu Time to release a new operator?

haf avatar Jul 26 '21 16:07 haf

It took me about 2 days to get SSL running. It took a lot of trial and error and it would have been unimaginable to try and migrate an existing productive cluster to SSL. I have deleted all data, all CRDs, all manifests about 50 times... before succeeding.

What worked in the end:

  • k8s: v1.22.2-3+9ad9ee77396805 with all container images set to their defaults
  • postgres-operator: 1.7.1 with all container images set to their defaults
  • cert-manager: 1.6.0
  • nginx-ingress
  • letsencrypt
  • first make sure cert-manger works and gets its (HTTP/S) certificate
  • make sure you do not have any postgres-operator CRDs installed, no postgres manifest, nothing like that
  • install CRDs and the works as shown in postgres-operator's Quickstart documentation
  • create a "kind: Postgres" manifest with:
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: db-cluster
  namespace: db-namespace
spec:
  spiloFSGroup: 103
  # DO NOT CONFIGURE TLS!!!
  [...etc...]

As @haf wrote (thanks a lot for that workaround/hint @haf !): wait for the cluster to stabilize. Which means: get the logs of each of the pods that contain the Postgres processes and wait until the logs say:

... INFO: no action. I am (...) the leader with the lock

respectively

... INFO: no action. I am a secondary (...) and following a leader (..)

Once they are all fine like shown above add the TLS settings:

apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: db-cluster
  namespace: db-namespace
spec:
  spiloFSGroup: 103
  tls:
    # whatever the secret's name is.
    secretName: letsencrypt-production
    # I am getting the certificate via cert-manager
    # I only succeeded getting cert-manager to work with a `ClusterIssuer` config. It would not work with an `Issuer`config.
  [...etc...]

and reapply the manifest.

Possibly, in order for the postgres pods to be able to use the certficate/secret you need to enable:

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-operator
data:
  enable_cross_namespace_secret: "true"

I am not sure if that's strictly required, but I'm done with trial and error now that I have a running setup.

If you are reading this and you have a deeper understanding of what exactly spilioFSGroup does, then future users would be helped if you could give the docu some love. Currently it only says:

"OpenShift allocates the users and groups dynamically (based on scc), and their range is different in every namespace. Due to this dynamic behaviour, it's not trivial to know at deploy time the uid/gid of the user in the cluster. Therefore, instead of using a global spilo_fsgroup setting, use the spiloFSGroup field per Postgres cluster."

... where the user needs to somehow figure out by himself what manifest is being meant by "Postgres cluster" and where exaclty the spilioFSGroup should be set (I found out by searching Github and finding this issue) and what the value of spilioFSGroup is supposed to be and that this in fact doesn't only apply to OpenShift but also to vanilla k8s clusters and that in that context SCC is meaningless (?).

Closing words: thanks a lot postgres/patroni/postgres-operator/k8s/zalando team(s), truly awesome work: <3 !!!! And to everybody that's contributing!!!

tpo avatar Nov 08 '21 20:11 tpo

Just a question, where did you guys pull the number 103 from? Is it the group uid set in the Spilo image?

MatteoGioioso avatar Apr 05 '22 11:04 MatteoGioioso

@MatteoGioioso

Just a question, where did you guys pull the number 103 from? Is it the group uid set in the Spilo image?

From the quoted docu

tpo avatar Apr 06 '22 12:04 tpo

  • create database with postgresql resource
  • applications cant connect because ssl is enabled, and they dont trust the cert
  • create a self-signed issuer per https://www.crunchydata.com/blog/using-cert-manager-to-deploy-tls-for-postgres-on-kubernetes (i wish zolando postgres-operator had these instructions :heart:)
  • modify postgresql resource to use cert secret
  • add spec.spiloFSGroup: 103 to postgresql resource

and magically things work!

this took a lot of running around.

i think the documentation could be increased in this area. https://postgres-operator.readthedocs.io/en/latest/user/#custom-tls-certificates is a start but

if applications are going to be fussy because of forced tls using an invalid certificate, there should be more setup instructions to facilitate good communication between clients and the new database.

travnewmatic avatar May 17 '22 08:05 travnewmatic

@travnewmatic thanks for the feedback. Maybe you link to increase our docs in a PR? :smiley:

FxKu avatar May 18 '22 07:05 FxKu

@travnewmatic thanks for the feedback. Maybe you link to increase our docs in a PR? smiley

Shouldn't this be fixed instead? I hit this roadblock, and was frustrated before I managed to find this thread. Luckily turning SSL off is a temporary, but viable, solution for us. Please fix, and many thanks for your amazing work!

caniko avatar Aug 08 '22 15:08 caniko

Has this been fixed?

h0jeZvgoxFepBQ2C avatar Mar 28 '23 15:03 h0jeZvgoxFepBQ2C

We did not receive any new reports on this topic. Seems running now for most of you. So I'm closing it.

FxKu avatar Apr 24 '24 07:04 FxKu