postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

Postres-operator cannot manage postgres cluster after upgrade.

Open hedgss opened this issue 4 years ago • 14 comments

Hello,

I upgraded the postgres-operator from version 1.2.0 to 1.5.0 Kubernetes: 1.18.3 What I have done:

  • We upgraded k8s cluster from 1.14.6 to 1.18.3
  • I have found out that the old postges-operator doesn't work properly with a manifest. It cannot create a new Postgres cluster.
  • Installed in another namespace the last postgres-operator. Both postgres-operator versions were watching only own namespace. Added a new manifest. Checked that a new cluster works.
  • Upgraded the old instance of postges-operator to the last version. I was using Helm 2.
  • I have got the errors.
  • Deleted the helm2 release.
  • Install it using helm3. No changes.

I'd really appreciate if somebody can clarify to me what I made wrong and how to fix it.

time="2020-07-01T14:58:08Z" level=warning msg="error while syncing cluster state: could not sync statefulsets: could not update statefulset: could not patch statefulset spec \"data/myteam-postgres-cluster-0\": StatefulSet.apps \"myteam-postgres-cluster-0\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden" cluster-name=data/myteam-postgres-cluster-0 pkg=cluster time="2020-07-01T14:58:08Z" level=error msg="could not sync cluster: could not sync statefulsets: could not update statefulset: could not patch statefulset spec \"data/myteam-postgres-cluster-0\": StatefulSet.apps \"myteam-postgres-cluster-0\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden" cluster-name=data/myteam-postgres-cluster-0 pkg=controller worker=1 time="2020-07-01T14:58:09Z" level=debug msg="Generating Spilo container, environment variables: [{SCOPE myteam-postgres-cluster-1 nil} {PGROOT /home/postgres/pgdata/pgroot nil} {POD_IP &EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {POD_NAMESPACE &EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {PGUSER_SUPERUSER postgres nil} {KUBERNETES_SCOPE_LABEL cluster-name nil} {KUBERNETES_ROLE_LABEL spilo-role nil} {PGPASSWORD_SUPERUSER &EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:&SecretKeySelector{LocalObjectReference:LocalObjectReference{Name:postgres.myteam-postgres-cluster-1.credentials,},Key:password,Optional:nil,},}} {PGUSER_STANDBY standby nil} {PGPASSWORD_STANDBY &EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:&SecretKeySelector{LocalObjectReference:LocalObjectReference{Name:standby.myteam-postgres-cluster-1.credentials,},Key:password,Optional:nil,},}} {PAM_OAUTH2 https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees nil} {HUMAN_ROLE zalandos nil} {KUBERNETES_LABELS {\"application\":\"spilo\"} nil} {SPILO_CONFIGURATION {\"postgresql\":{\"bin_dir\":\"/usr/lib/postgresql/11/bin\",\"parameters\":{\"bg_mon.naptime\":\"10\",\"bg_mon.port\":\"8000\",\"default_statistics_target\":\"100\",\"effective_cache_size\":\"2304MB\",\"log_min_error_statement\":\"FATAL\",\"log_min_messages\":\"FATAL\",\"log_statement\":\"ddl\",\"maintenance_work_mem\":\"192MB\",\"max_wal_size\":\"2GB\",\"min_wal_size\":\"1GB\",\"shared_buffers\":\"768MB\",\"wal_buffers\":\"16MB\",\"work_mem\":\"3932kB\"},\"pg_hba\":[\"host all all 127.0.0.1/32 trust\",\"hostssl all all 0.0.0.0/0 md5\",\"host all all 0.0.0.0/0 md5\"]},\"bootstrap\":{\"initdb\":[{\"auth-host\":\"md5\"},{\"auth-local\":\"trust\"},\"data-checksums\",{\"encoding\":\"UTF8\"},{\"locale\":\"en_US.UTF-8\"}],\"users\":{\"zalandos\":{\"password\":\"\",\"options\":[\"CREATEDB\",\"NOLOGIN\"]}},\"dcs\":{\"ttl\":30,\"loop_wait\":10,\"retry_timeout\":10,\"maximum_lag_on_failover\":33554432,\"postgresql\":{\"parameters\":{\"max_connections\":\"100\",\"wal_level\":\"logical\"}}}}} nil} {DCS_ENABLE_KUBERNETES_API true nil} {BG_MON_PORT 8000 nil} {FLUENTBIT_LOG_LEVEL info nil} {KAFKA_LOGS_BROCKERS kafka.auxiliary:9092 nil} {KAFKA_LOGS_TOPICS fluentbit-logs nil} {MONITORING_INTERVAL 10s nil} {TELEGRAF_METRIC_BATCH_SIZE 1001 nil} {TELEGRAF_METRIC_BUFFER_LIMIT 60000 nil} {TELEGRAF_PROMETHEUS_METRICS_PORT 8080 nil}]" cluster-name=data/myteam-postgres-cluster-1 pkg=cluster time="2020-07-01T14:58:09Z" level=warning msg="error while syncing cluster state: could not sync statefulsets: could not update statefulset: could not patch statefulset spec \"data/myteam-postgres-cluster-1\": StatefulSet.apps \"myteam-postgres-cluster-1\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden" cluster-name=data/myteam-postgres-cluster-1 pkg=cluster time="2020-07-01T14:58:09Z" level=error msg="could not update status: postgresql.acid.zalan.do \"myteam-postgres-cluster-1\" is invalid: spec.patroni.slots: Invalid value: \"null\": spec.patroni.slots in body must be of type object: \"null\"" cluster-name=data/myteam-postgres-cluster-1 pkg=cluster time="2020-07-01T14:58:09Z" level=error msg="could not sync cluster: could not sync statefulsets: could not update statefulset: could not patch statefulset spec \"data/myteam-postgres-cluster-1\": StatefulSet.apps \"myteam-postgres-cluster-1\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden" cluster-name=data/myteam-postgres-cluster-1 pkg=controller worker=2 time="2020-07-01T14:58:24Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:58:39Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:58:54Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:59:09Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:59:24Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:59:39Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:59:54Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T15:00:09Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T15:00:09Z" level=warning msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T15:00:09Z" level=error msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=data/myteam-postgres-cluster-2 pkg=controller worker=4

hedgss avatar Jul 01 '20 15:07 hedgss

How did you upgrade the operator? Have you updated the cluster roles, too, as there were some changes between v1.2.0 and v1.5.0. Spilo pods are now using their own cluster role postgres-pod.

I can also see errors coming from CRD validation which was introduced with v1.3.0. And it can hit you when the status of the Postgres CRD is updated. Then the manifest is validated again and the way you specified slots under the Patroni section seems to be wrong. I remember that we also had it wrong once in our example manifest.

postgresql.acid.zalan.do \"myteam-postgres-cluster-1\" is invalid: spec.patroni.slots: Invalid value: \"null\": spec.patroni.slots in body must be of type object: \"null\""

FxKu avatar Jul 09 '20 15:07 FxKu

I upgraded helm installation for postgres-operator. Postgres cluster was available, but postgres-operator cannot manage them anymore.

I had some errors connected with manifests, I fixed them, but it didn't solve the issue.

Eventually, I removed postgres-operator and all related configs. Installed the new one using helm 3.. Installed new postgres clusters and migrated all the data. Now it works, but thing is that I will need to upgrade more postgres-operator and postgres clusters.

How can it be done with less effort? What should I pay attention to?

Also, I am going to upgrade spilo version from the spilo-11:1.6-p1 to the last one. The last one uses a new PostgrSQL. Will spilo handle it automatically? I was using a custom spilo image with pg_partman. The new spilo image already contains it. How will it be migrated?

In the spilo-11:1.6-p1 it has permission configuration : . What should I do If I want to reduce permissions for user from db level to schema level?

hedgss avatar Jul 14 '20 14:07 hedgss

To update spilo you simply change the docker image in the config and a rolling update will be performed. To update your other clusters better update the RBAC part and configuration first, before updating the image in the operator deployment. And of course, exercise it in a test environment first and provide feedback here if it doesn't work before doing it in production.

FxKu avatar Jul 14 '20 15:07 FxKu

We had the same issue. I was using your helm chart (which does the rbac stuff) and upgraded from 1.3 to 1.5.

  • the operator itself upgraded itself and wants to change the statefulsets (which it cannot, because the attributes are immutable)
  • I deleted the sts to let the operator recreate them
  • The new clusters could not spawn because it seems like it tried to bootstrap again, but the persistent volumes still had the data (of course, because we want to migrate it) and it fails with:
psycopg2.OperationalError: FATAL:  password authentication failed for user "standby"

no pg_hba.conf entry for replication connection from host "127.0.0.1", user "standby", SSL off

What I saw later was the the suffix for the credential secret name changed so it created totally new passwords. Maybe by setting the same secret suffix this can even prevented. In the end, for my use case I pg_dumped all databases of all clusters manually, deleted all postgresql cluster objects, all pods, all pvcs, all secrets, all pdbs (because the 1.5 did not recognize those objects anymore), recreated all clusters objects and then restored all databases via cat dump.sql | psql $database for each cluster and setting the new secret credential name for each application using those clusters. This took several days, but at least it's working now again for me.

stefanandres avatar Jul 16 '20 07:07 stefanandres

Thanks for detailed feedback @stefanandres. Makes me think if we should abandon helm charts as we don't use them ourselves and don't exercise upgrades with them. They are only a service to our community to lower the entrance barrier. But if more and more people start relying on them and it breaks their infrastructure for new releases we should have a plan how to treat this. I actually prefer kustomize over helm chart nowadays and could think of going more into this direction. Would also greatly reduce code redundancy and hence make PRs easier.

FxKu avatar Jul 16 '20 09:07 FxKu

Ah, this is interesting to here and explains why I did not find a migration guide for this use case. :-) For us, it really is helpful to have those helm charts. We are basing most of our deployments on helm charts and this fits perfectly in our structure. Having a remote repo of helm charts and helm chart versions is a great help when doing operations.

Do you "support" (in a sense that you described that you use it yourselfs and keep it up2date with migrations in mind) using kustomize for your operator? If yes, do you have any documentation on this? Since we are using helmfile to manage our helm deployments, we might also give kustomize a try for this (even though migrating to it might be painful again).

stefanandres avatar Jul 16 '20 09:07 stefanandres

@FxKu It is really common to use helm to manage deployments in Kubernetes. Helm 3 now is replacing helm 2. It is stable and has a lot of pros and it got rid of some cons. Perhaps you can support at least helm 3 charts If it is complicated to support everything.

In any case, detailed changelog and some upgrade information would be extremely useful.

hedgss avatar Jul 16 '20 14:07 hedgss

I am working on the upgrade process. What was done:

  • dumped all k8s configuration.
  • Uninstalled Postgres-operator 1.2.0 (Helm 2 chart)
  • Installed Postgres-operator 1.5.0 using helm 3
  • Then I got an error that postgres-operator cannot patch the statefulset. I deleted statefulset. Restarted the postgres-operator/ It created successfully a new statefulset and Postgres nodes. The Postgres nodes use the old PVCs. The Postgres nodes started successfully, but they cannot form the postgres node properly.
  • I have this conflict Postgres-operator logs
 time="2020-07-27T15:49:58Z" level=debug msg="Waiting for 2 pods to become ready" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:13Z" level=info msg="created missing statefulset \"test/myteam-postgresql-cluster\"" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=debug msg="calling Patroni API on a pod test/myteam-postgresql-cluster-0 to set the following Postgres options: map[max_connections:500 max_worker_processes:3 wal_level:logical]" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=debug msg="making PATCH http request: http://10.42.7.20:8008/config" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=warning msg="could not patch postgres parameters with a pod test/myteam-postgresql-cluster-0: patroni returned '<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\"\n
     \"http://www.w3.org/TR/html4/strict.dtd\">\n<html>\n    <head>\n        <meta http-equiv=\"Content-Type\" content=\"text/html;charset=utf-8\">\n        <title>Error response</title>\n    </head>\n
   <body>\n        <h1>Error response</h1>\n        <p>Error code: 409</p>\n        <p>Message: Conflict.</p>\n        <p>Error code explanation: 409 - Request conflict.</p>\n    </body>\n</html>\n'" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=debug msg="calling Patroni API on a pod test/myteam-postgresql-cluster-1 to set the following Postgres options: map[max_connections:500 max_worker_processes:3 wal_level:logical]" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=debug msg="making PATCH http request: http://10.42.2.241:8008/config" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=warning msg="could not patch postgres parameters with a pod test/myteam-postgresql-cluster-1: patroni returned '<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\"\n
     \"http://www.w3.org/TR/html4/strict.dtd\">\n<html>\n    <head>\n        <meta http-equiv=\"Content-Type\" content=\"text/html;charset=utf-8\">\n        <title>Error response</title>\n    </head>\n
   <body>\n        <h1>Error response</h1>\n        <p>Error code: 409</p>\n        <p>Message: Conflict.</p>\n        <p>Error code explanation: 409 - Request conflict.</p>\n    </body>\n</html>\n'" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=warning msg="error while syncing cluster state: could not sync statefulsets: could not set cluster-wide PostgreSQL configuration options: could not reach Patroni API to set Postgres options: failed on every pod (2 total)" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=error msg="could not sync cluster: could not sync statefulsets: could not set cluster-wide PostgreSQL configuration options: could not reach Patroni API to set Postgres options: failed on every pod (2 total)" cluster-name=test/myteam-postgresql-cluster pkg=controller worker=0
time="2020-07-27T15:50:14Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"test\", Name:\"myteam-postgresql-cluster\", UID:\"79de6a7f-cff4-11ea-af1a-0050568c57b5\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"99263463\", FieldPath:\"\"}): type: 'Warning' reason: 'Sync' could not sync cluster: could not sync statefulsets: could not set cluster-wide PostgreSQL configuration options: could not reach Patroni API to set Postgres options: failed on every pod (2 total)"
time="2020-07-27T15:50:14Z" level=debug msg="cluster already exists" cluster-name=test/myteam-postgresql-cluster pkg=controller worker=0

Postgres node logs:

2020-07-28 12:21:32,230 INFO: Got response from myteam-postgresql-cluster-0 http://10.42.7.20:8008/patroni: b'{"state": "running", "postmaster_start_time": "2020-07-27 15:50:01.180 UTC", "role": "replica", "server_version": 110005, "cluster_unlocked": true, "xlog": {"received_location": 67109016, "replayed_location": 67109016, "replayed_timestamp": null, "paused": false}, "timeline": 1, "database_system_identifier": "6854106283562688589", "patroni": {"version": "1.6.0", "scope": "myteam-postgresql-cluster"}}'
2020-07-28 12:21:32,324 INFO: Could not take out TTL lock
2020-07-28 12:21:32,351 INFO: following new leader after trying and failing to obtain lock

Patroni status inside postgres node:

root@myteam-postgresql-cluster-0:/home/postgres# patronictl list
+---------------------------+-----------------------------+-------------+------+---------+----+-----------+
|          Cluster          |            Member           |     Host    | Role |  State  | TL | Lag in MB |
+---------------------------+-----------------------------+-------------+------+---------+----+-----------+
| myteam-postgresql-cluster | myteam-postgresql-cluster-0 |  10.42.7.20 |      | running |  1 |           |
| myteam-postgresql-cluster | myteam-postgresql-cluster-1 | 10.42.2.241 |      | running |  1 |           |
+---------------------------+-----------------------------+-------------+------+---------+----+-----------+
  • I also tried to re-use the old postgres-operator secrets with tokens Postgres-operator cannot start
2020/07/28 12:15:50 Fully qualified configmap name: test/postgres-operator-new
2020/07/28 12:15:50 Spilo operator v1.5.0-dirty
panic: Unauthorized

goroutine 1 [running]:
github.com/zalando/postgres-operator/pkg/controller.(*Controller).initOperatorConfig(0xc000464dc0)
        /home/fekunde/go/src/github.com/zalando/postgres-operator/pkg/controller/controller.go:124 +0x650
github.com/zalando/postgres-operator/pkg/controller.(*Controller).initController(0xc000464dc0)
        /home/fekunde/go/src/github.com/zalando/postgres-operator/pkg/controller/controller.go:282 +0x8a2
github.com/zalando/postgres-operator/pkg/controller.(*Controller).Run(0xc000464dc0, 0xc0000b4780, 0xc000302020)
        /home/fekunde/go/src/github.com/zalando/postgres-operator/pkg/controller/controller.go:377 +0x2f
main.main()
        /home/fekunde/go/src/github.com/zalando/postgres-operator/cmd/main.go:82 +0x25e

What else can I try?

The previous time I found a command that deletes data about the Postgres cluster. That solution helped me to start the Postgres cluster with the same name. Unfortunately I cannot find this command. I remember only that it deletes some key from etcd.

hedgss avatar Jul 28 '20 12:07 hedgss

If somebody has similar issues you can do the following.

  1. Delete all kubernetes endpoints that are connected with your Postgres cluster. In my case it is myteam-postgresql-cluster*
# kubectl get ep
NAME                               ENDPOINTS          AGE
myteam-postgresql-cluster          10.12.7.24:5432    17h
myteam-postgresql-cluster-config   <none>             15h
myteam-postgresql-cluster-repl     10.12.2.248:5432   17h
postgres-operator-new              10.12.8.14:8080    2d16h

Also, you can try to use the latest Spilo If it doesn't help in your case.

P.S. Thanks a lot Alexander Kukushkin

hedgss avatar Jul 30 '20 07:07 hedgss

JFY, deleting resources is not how it supposed to be fixed. The issue is that kubernetes.labels in the Patroni configuration didn't match with labels set on these objects. You are supposed to find the difference between them and figure out why it happened. I guess something changed in the operator configuration.

CyberDem0n avatar Jul 30 '20 07:07 CyberDem0n

@CyberDem0n

But it was the easiest and fastest way to fix it in my case :). Because Postgres-operator will re-create services and endpoints if they don't exist. It created them with the correct configuration.

Yes, the main cause was the labels, because they are different in postgres-operator 1.2.0 and 1.5.0

Thank you for your help and clarification!

hedgss avatar Jul 30 '20 08:07 hedgss

@FxKu, @CyberDem0n

I am testing spilo upgrade from spilo-11:1.6-p1 to the last one spilo-12:1.6-p3.

It starts successfully. I expected that cluster will be upgraded to postgres version 12. But it didn't happen.

I set the version in the manifest. Restarted pods, but it didn't take effect.

  postgresql:
    version: "12"

I see no postgres version in /home/postgres/postgres.yml, environment variable PGVERSION also is not set.

How to upgrade Postgres 11 to Postgres 12 in spilo image? Should I do it manually or will it be done automatically? What is the correct upgrade flow?

hedgss avatar Aug 05 '20 16:08 hedgss

Changing the version in the manifest does nothing. In-place version upgrade is not supported yet. See docs.

The version in the docker image name only refers to the latest Postgres version this image contains. It usually supports multiple versions (back to 9.5) in order to do upgrades.

FxKu avatar Aug 07 '20 12:08 FxKu

also joining the party late but from coming from enterprise standpoint a lot of them are dependent on HELM3 so supporting helm3 might make lot of sense.

JayGhiya avatar Dec 22 '23 12:12 JayGhiya