postgres-operator
postgres-operator copied to clipboard
Postres-operator cannot manage postgres cluster after upgrade.
Hello,
I upgraded the postgres-operator from version 1.2.0 to 1.5.0 Kubernetes: 1.18.3 What I have done:
- We upgraded k8s cluster from 1.14.6 to 1.18.3
- I have found out that the old postges-operator doesn't work properly with a manifest. It cannot create a new Postgres cluster.
- Installed in another namespace the last postgres-operator. Both postgres-operator versions were watching only own namespace. Added a new manifest. Checked that a new cluster works.
- Upgraded the old instance of postges-operator to the last version. I was using Helm 2.
- I have got the errors.
- Deleted the helm2 release.
- Install it using helm3. No changes.
I'd really appreciate if somebody can clarify to me what I made wrong and how to fix it.
time="2020-07-01T14:58:08Z" level=warning msg="error while syncing cluster state: could not sync statefulsets: could not update statefulset: could not patch statefulset spec \"data/myteam-postgres-cluster-0\": StatefulSet.apps \"myteam-postgres-cluster-0\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden" cluster-name=data/myteam-postgres-cluster-0 pkg=cluster time="2020-07-01T14:58:08Z" level=error msg="could not sync cluster: could not sync statefulsets: could not update statefulset: could not patch statefulset spec \"data/myteam-postgres-cluster-0\": StatefulSet.apps \"myteam-postgres-cluster-0\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden" cluster-name=data/myteam-postgres-cluster-0 pkg=controller worker=1 time="2020-07-01T14:58:09Z" level=debug msg="Generating Spilo container, environment variables: [{SCOPE myteam-postgres-cluster-1 nil} {PGROOT /home/postgres/pgdata/pgroot nil} {POD_IP &EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {POD_NAMESPACE &EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {PGUSER_SUPERUSER postgres nil} {KUBERNETES_SCOPE_LABEL cluster-name nil} {KUBERNETES_ROLE_LABEL spilo-role nil} {PGPASSWORD_SUPERUSER &EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:&SecretKeySelector{LocalObjectReference:LocalObjectReference{Name:postgres.myteam-postgres-cluster-1.credentials,},Key:password,Optional:nil,},}} {PGUSER_STANDBY standby nil} {PGPASSWORD_STANDBY &EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:&SecretKeySelector{LocalObjectReference:LocalObjectReference{Name:standby.myteam-postgres-cluster-1.credentials,},Key:password,Optional:nil,},}} {PAM_OAUTH2 https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees nil} {HUMAN_ROLE zalandos nil} {KUBERNETES_LABELS {\"application\":\"spilo\"} nil} {SPILO_CONFIGURATION {\"postgresql\":{\"bin_dir\":\"/usr/lib/postgresql/11/bin\",\"parameters\":{\"bg_mon.naptime\":\"10\",\"bg_mon.port\":\"8000\",\"default_statistics_target\":\"100\",\"effective_cache_size\":\"2304MB\",\"log_min_error_statement\":\"FATAL\",\"log_min_messages\":\"FATAL\",\"log_statement\":\"ddl\",\"maintenance_work_mem\":\"192MB\",\"max_wal_size\":\"2GB\",\"min_wal_size\":\"1GB\",\"shared_buffers\":\"768MB\",\"wal_buffers\":\"16MB\",\"work_mem\":\"3932kB\"},\"pg_hba\":[\"host all all 127.0.0.1/32 trust\",\"hostssl all all 0.0.0.0/0 md5\",\"host all all 0.0.0.0/0 md5\"]},\"bootstrap\":{\"initdb\":[{\"auth-host\":\"md5\"},{\"auth-local\":\"trust\"},\"data-checksums\",{\"encoding\":\"UTF8\"},{\"locale\":\"en_US.UTF-8\"}],\"users\":{\"zalandos\":{\"password\":\"\",\"options\":[\"CREATEDB\",\"NOLOGIN\"]}},\"dcs\":{\"ttl\":30,\"loop_wait\":10,\"retry_timeout\":10,\"maximum_lag_on_failover\":33554432,\"postgresql\":{\"parameters\":{\"max_connections\":\"100\",\"wal_level\":\"logical\"}}}}} nil} {DCS_ENABLE_KUBERNETES_API true nil} {BG_MON_PORT 8000 nil} {FLUENTBIT_LOG_LEVEL info nil} {KAFKA_LOGS_BROCKERS kafka.auxiliary:9092 nil} {KAFKA_LOGS_TOPICS fluentbit-logs nil} {MONITORING_INTERVAL 10s nil} {TELEGRAF_METRIC_BATCH_SIZE 1001 nil} {TELEGRAF_METRIC_BUFFER_LIMIT 60000 nil} {TELEGRAF_PROMETHEUS_METRICS_PORT 8080 nil}]" cluster-name=data/myteam-postgres-cluster-1 pkg=cluster time="2020-07-01T14:58:09Z" level=warning msg="error while syncing cluster state: could not sync statefulsets: could not update statefulset: could not patch statefulset spec \"data/myteam-postgres-cluster-1\": StatefulSet.apps \"myteam-postgres-cluster-1\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden" cluster-name=data/myteam-postgres-cluster-1 pkg=cluster time="2020-07-01T14:58:09Z" level=error msg="could not update status: postgresql.acid.zalan.do \"myteam-postgres-cluster-1\" is invalid: spec.patroni.slots: Invalid value: \"null\": spec.patroni.slots in body must be of type object: \"null\"" cluster-name=data/myteam-postgres-cluster-1 pkg=cluster time="2020-07-01T14:58:09Z" level=error msg="could not sync cluster: could not sync statefulsets: could not update statefulset: could not patch statefulset spec \"data/myteam-postgres-cluster-1\": StatefulSet.apps \"myteam-postgres-cluster-1\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden" cluster-name=data/myteam-postgres-cluster-1 pkg=controller worker=2 time="2020-07-01T14:58:24Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:58:39Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:58:54Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:59:09Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:59:24Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:59:39Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T14:59:54Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T15:00:09Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.43.57.207:5432: i/o timeout" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T15:00:09Z" level=warning msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=data/myteam-postgres-cluster-2 pkg=cluster time="2020-07-01T15:00:09Z" level=error msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=data/myteam-postgres-cluster-2 pkg=controller worker=4
How did you upgrade the operator? Have you updated the cluster roles, too, as there were some changes between v1.2.0 and v1.5.0. Spilo pods are now using their own cluster role postgres-pod
.
I can also see errors coming from CRD validation which was introduced with v1.3.0. And it can hit you when the status of the Postgres CRD is updated. Then the manifest is validated again and the way you specified slots under the Patroni section seems to be wrong. I remember that we also had it wrong once in our example manifest.
postgresql.acid.zalan.do \"myteam-postgres-cluster-1\" is invalid: spec.patroni.slots: Invalid value: \"null\": spec.patroni.slots in body must be of type object: \"null\""
I upgraded helm installation for postgres-operator. Postgres cluster was available, but postgres-operator cannot manage them anymore.
I had some errors connected with manifests, I fixed them, but it didn't solve the issue.
Eventually, I removed postgres-operator and all related configs. Installed the new one using helm 3.. Installed new postgres clusters and migrated all the data. Now it works, but thing is that I will need to upgrade more postgres-operator and postgres clusters.
How can it be done with less effort? What should I pay attention to?
Also, I am going to upgrade spilo version from the spilo-11:1.6-p1 to the last one. The last one uses a new PostgrSQL. Will spilo handle it automatically? I was using a custom spilo image with pg_partman. The new spilo image already contains it. How will it be migrated?
In the spilo-11:1.6-p1 it has permission configuration
To update spilo you simply change the docker image in the config and a rolling update will be performed. To update your other clusters better update the RBAC part and configuration first, before updating the image in the operator deployment. And of course, exercise it in a test environment first and provide feedback here if it doesn't work before doing it in production.
We had the same issue. I was using your helm chart (which does the rbac stuff) and upgraded from 1.3 to 1.5.
- the operator itself upgraded itself and wants to change the statefulsets (which it cannot, because the attributes are immutable)
- I deleted the sts to let the operator recreate them
- The new clusters could not spawn because it seems like it tried to bootstrap again, but the persistent volumes still had the data (of course, because we want to migrate it) and it fails with:
psycopg2.OperationalError: FATAL: password authentication failed for user "standby"
no pg_hba.conf entry for replication connection from host "127.0.0.1", user "standby", SSL off
What I saw later was the the suffix for the credential secret name changed so it created totally new passwords. Maybe by setting the same secret suffix this can even prevented.
In the end, for my use case I pg_dumped all databases of all clusters manually, deleted all postgresql cluster objects, all pods, all pvcs, all secrets, all pdbs (because the 1.5 did not recognize those objects anymore), recreated all clusters objects and then restored all databases via cat dump.sql | psql $database
for each cluster and setting the new secret credential name for each application using those clusters. This took several days, but at least it's working now again for me.
Thanks for detailed feedback @stefanandres. Makes me think if we should abandon helm charts as we don't use them ourselves and don't exercise upgrades with them. They are only a service to our community to lower the entrance barrier. But if more and more people start relying on them and it breaks their infrastructure for new releases we should have a plan how to treat this. I actually prefer kustomize over helm chart nowadays and could think of going more into this direction. Would also greatly reduce code redundancy and hence make PRs easier.
Ah, this is interesting to here and explains why I did not find a migration guide for this use case. :-) For us, it really is helpful to have those helm charts. We are basing most of our deployments on helm charts and this fits perfectly in our structure. Having a remote repo of helm charts and helm chart versions is a great help when doing operations.
Do you "support" (in a sense that you described that you use it yourselfs and keep it up2date with migrations in mind) using kustomize for your operator? If yes, do you have any documentation on this? Since we are using helmfile to manage our helm deployments, we might also give kustomize a try for this (even though migrating to it might be painful again).
@FxKu It is really common to use helm to manage deployments in Kubernetes. Helm 3 now is replacing helm 2. It is stable and has a lot of pros and it got rid of some cons. Perhaps you can support at least helm 3 charts If it is complicated to support everything.
In any case, detailed changelog and some upgrade information would be extremely useful.
I am working on the upgrade process. What was done:
- dumped all k8s configuration.
- Uninstalled Postgres-operator 1.2.0 (Helm 2 chart)
- Installed Postgres-operator 1.5.0 using helm 3
- Then I got an error that postgres-operator cannot patch the statefulset. I deleted statefulset. Restarted the postgres-operator/ It created successfully a new statefulset and Postgres nodes. The Postgres nodes use the old PVCs. The Postgres nodes started successfully, but they cannot form the postgres node properly.
- I have this conflict Postgres-operator logs
time="2020-07-27T15:49:58Z" level=debug msg="Waiting for 2 pods to become ready" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:13Z" level=info msg="created missing statefulset \"test/myteam-postgresql-cluster\"" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=debug msg="calling Patroni API on a pod test/myteam-postgresql-cluster-0 to set the following Postgres options: map[max_connections:500 max_worker_processes:3 wal_level:logical]" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=debug msg="making PATCH http request: http://10.42.7.20:8008/config" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=warning msg="could not patch postgres parameters with a pod test/myteam-postgresql-cluster-0: patroni returned '<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\"\n
\"http://www.w3.org/TR/html4/strict.dtd\">\n<html>\n <head>\n <meta http-equiv=\"Content-Type\" content=\"text/html;charset=utf-8\">\n <title>Error response</title>\n </head>\n
<body>\n <h1>Error response</h1>\n <p>Error code: 409</p>\n <p>Message: Conflict.</p>\n <p>Error code explanation: 409 - Request conflict.</p>\n </body>\n</html>\n'" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=debug msg="calling Patroni API on a pod test/myteam-postgresql-cluster-1 to set the following Postgres options: map[max_connections:500 max_worker_processes:3 wal_level:logical]" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=debug msg="making PATCH http request: http://10.42.2.241:8008/config" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=warning msg="could not patch postgres parameters with a pod test/myteam-postgresql-cluster-1: patroni returned '<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\"\n
\"http://www.w3.org/TR/html4/strict.dtd\">\n<html>\n <head>\n <meta http-equiv=\"Content-Type\" content=\"text/html;charset=utf-8\">\n <title>Error response</title>\n </head>\n
<body>\n <h1>Error response</h1>\n <p>Error code: 409</p>\n <p>Message: Conflict.</p>\n <p>Error code explanation: 409 - Request conflict.</p>\n </body>\n</html>\n'" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=warning msg="error while syncing cluster state: could not sync statefulsets: could not set cluster-wide PostgreSQL configuration options: could not reach Patroni API to set Postgres options: failed on every pod (2 total)" cluster-name=test/myteam-postgresql-cluster pkg=cluster
time="2020-07-27T15:50:14Z" level=error msg="could not sync cluster: could not sync statefulsets: could not set cluster-wide PostgreSQL configuration options: could not reach Patroni API to set Postgres options: failed on every pod (2 total)" cluster-name=test/myteam-postgresql-cluster pkg=controller worker=0
time="2020-07-27T15:50:14Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"test\", Name:\"myteam-postgresql-cluster\", UID:\"79de6a7f-cff4-11ea-af1a-0050568c57b5\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"99263463\", FieldPath:\"\"}): type: 'Warning' reason: 'Sync' could not sync cluster: could not sync statefulsets: could not set cluster-wide PostgreSQL configuration options: could not reach Patroni API to set Postgres options: failed on every pod (2 total)"
time="2020-07-27T15:50:14Z" level=debug msg="cluster already exists" cluster-name=test/myteam-postgresql-cluster pkg=controller worker=0
Postgres node logs:
2020-07-28 12:21:32,230 INFO: Got response from myteam-postgresql-cluster-0 http://10.42.7.20:8008/patroni: b'{"state": "running", "postmaster_start_time": "2020-07-27 15:50:01.180 UTC", "role": "replica", "server_version": 110005, "cluster_unlocked": true, "xlog": {"received_location": 67109016, "replayed_location": 67109016, "replayed_timestamp": null, "paused": false}, "timeline": 1, "database_system_identifier": "6854106283562688589", "patroni": {"version": "1.6.0", "scope": "myteam-postgresql-cluster"}}'
2020-07-28 12:21:32,324 INFO: Could not take out TTL lock
2020-07-28 12:21:32,351 INFO: following new leader after trying and failing to obtain lock
Patroni status inside postgres node:
root@myteam-postgresql-cluster-0:/home/postgres# patronictl list
+---------------------------+-----------------------------+-------------+------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+---------------------------+-----------------------------+-------------+------+---------+----+-----------+
| myteam-postgresql-cluster | myteam-postgresql-cluster-0 | 10.42.7.20 | | running | 1 | |
| myteam-postgresql-cluster | myteam-postgresql-cluster-1 | 10.42.2.241 | | running | 1 | |
+---------------------------+-----------------------------+-------------+------+---------+----+-----------+
- I also tried to re-use the old postgres-operator secrets with tokens Postgres-operator cannot start
2020/07/28 12:15:50 Fully qualified configmap name: test/postgres-operator-new
2020/07/28 12:15:50 Spilo operator v1.5.0-dirty
panic: Unauthorized
goroutine 1 [running]:
github.com/zalando/postgres-operator/pkg/controller.(*Controller).initOperatorConfig(0xc000464dc0)
/home/fekunde/go/src/github.com/zalando/postgres-operator/pkg/controller/controller.go:124 +0x650
github.com/zalando/postgres-operator/pkg/controller.(*Controller).initController(0xc000464dc0)
/home/fekunde/go/src/github.com/zalando/postgres-operator/pkg/controller/controller.go:282 +0x8a2
github.com/zalando/postgres-operator/pkg/controller.(*Controller).Run(0xc000464dc0, 0xc0000b4780, 0xc000302020)
/home/fekunde/go/src/github.com/zalando/postgres-operator/pkg/controller/controller.go:377 +0x2f
main.main()
/home/fekunde/go/src/github.com/zalando/postgres-operator/cmd/main.go:82 +0x25e
What else can I try?
The previous time I found a command that deletes data about the Postgres cluster. That solution helped me to start the Postgres cluster with the same name. Unfortunately I cannot find this command. I remember only that it deletes some key from etcd.
If somebody has similar issues you can do the following.
- Delete all kubernetes endpoints that are connected with your Postgres cluster. In my case it is
myteam-postgresql-cluster*
# kubectl get ep
NAME ENDPOINTS AGE
myteam-postgresql-cluster 10.12.7.24:5432 17h
myteam-postgresql-cluster-config <none> 15h
myteam-postgresql-cluster-repl 10.12.2.248:5432 17h
postgres-operator-new 10.12.8.14:8080 2d16h
Also, you can try to use the latest Spilo If it doesn't help in your case.
P.S. Thanks a lot Alexander Kukushkin
JFY, deleting resources is not how it supposed to be fixed.
The issue is that kubernetes.labels
in the Patroni configuration didn't match with labels set on these objects.
You are supposed to find the difference between them and figure out why it happened.
I guess something changed in the operator configuration.
@CyberDem0n
But it was the easiest and fastest way to fix it in my case :). Because Postgres-operator will re-create services and endpoints if they don't exist. It created them with the correct configuration.
Yes, the main cause was the labels, because they are different in postgres-operator 1.2.0 and 1.5.0
Thank you for your help and clarification!
@FxKu, @CyberDem0n
I am testing spilo upgrade from spilo-11:1.6-p1 to the last one spilo-12:1.6-p3.
It starts successfully. I expected that cluster will be upgraded to postgres version 12. But it didn't happen.
I set the version in the manifest. Restarted pods, but it didn't take effect.
postgresql:
version: "12"
I see no postgres version in /home/postgres/postgres.yml
, environment variable PGVERSION
also is not set.
How to upgrade Postgres 11 to Postgres 12 in spilo image? Should I do it manually or will it be done automatically? What is the correct upgrade flow?
Changing the version in the manifest does nothing. In-place version upgrade is not supported yet. See docs.
The version in the docker image name only refers to the latest Postgres version this image contains. It usually supports multiple versions (back to 9.5) in order to do upgrades.
also joining the party late but from coming from enterprise standpoint a lot of them are dependent on HELM3 so supporting helm3 might make lot of sense.