postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

pod labels error: still failing after 200 retries

Open neelasha-09 opened this issue 5 years ago • 3 comments

Hi Team,

While trying to create the postgres-cluster cluster takes ~5 mins to bind to the PVC due to which the postgres-cluster fails with below error,

time="2020-07-14T12:32:14Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"postgres-test\", Name:\"postgres-operator-cluster\", UID:\"cee078a3-0bea-4175-a6c8-7530590058ac\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"157085264\", FieldPath:\"\"}): type: 'Warning' reason: 'Sync' could not sync cluster: could not sync statefulsets: cluster is not ready: pod labels error: still failing after 200 retries"

Could you please suggest to increase the no of retries or the wait time till the PVC is bound and not face error.

Thanks

neelasha-09 avatar Jul 15 '20 06:07 neelasha-09

5 minutes sounds too long. Which provisioner are you using? However, timeouts can be configured. Try to increase these values if needed.

FxKu avatar Jul 21 '20 16:07 FxKu

Hi

The error says fails after 200 reties. The resource_check_interval is 3s which means 3s*200 reties = 600 secs = 10mins But the resource_check_timeout by default is 10 mins

The question is if PVC takes ~5 mins to bound so why the error ? Please let me know if i missed anything :)

Thanks in advance.

neelasha-09 avatar Aug 24 '20 12:08 neelasha-09

I know this is an ancient issue, but I ran into a similar issue today and thought I would provide some context in case others run into the same situation. This was provisioning a ZPO managed Postgresql database on OpenShift CRC. I have provisioned many of them successfully so this was seemed more like a bizarre one-off incident, but I gathered some logs and events in case it helps anyone here at some point.

Operator logs are not saying much:

time="2024-05-22T09:35:19Z" level=info msg="ADD event has been queued" cluster-name=appns/app-db pkg=controller worker=1
time="2024-05-22T09:35:19Z" level=info msg="creating a new Postgres cluster" cluster-name=appns/app-db pkg=controller worker=1
time="2024-05-22T09:35:19Z" level=info msg="master service \"appns/app-db\" has been successfully created" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=info msg="replica service \"appns/app-db-repl\" has been successfully created" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=debug msg="team API is disabled" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=debug msg="team API is disabled" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=info msg="users have been initialized" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=info msg="syncing secrets" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=debug msg="created new secret appns/standby.app-db.credentials.postgresql.acid.zalan.do, namespace: appns, uid: 7a125a12-afca-4322-99da-6d648a0b204e" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=debug msg="created new secret appns/appuser.app-db.credentials.postgresql.acid.zalan.do, namespace: appns, uid: d8c479c3-5ee6-48c8-843a-25a2591918df" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=debug msg="created new secret appns/postgres.app-db.credentials.postgresql.acid.zalan.do, namespace: appns, uid: ea890b6d-28b9-40b4-be7d-d5ed9cd2b19c" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=info msg="secrets have been successfully created" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=info msg="pod disruption budget \"appns/postgres-app-db-pdb\" has been successfully created" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=debug msg="created new statefulset \"appns/app-db\", uid: \"072aa2ee-a1c2-4f5d-8648-30bea93adcdf\"" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=info msg="statefulset \"appns/app-db\" has been successfully created" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:19Z" level=info msg="waiting for the cluster being ready" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:35:25Z" level=debug msg="Waiting for 2 pods to become ready" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:45:22Z" level=error msg="failed to create cluster: pod labels error: still failing after 200 retries" cluster-name=appns/app-db pkg=cluster worker=1
time="2024-05-22T09:45:22Z" level=error msg="could not create cluster: pod labels error: still failing after 200 retries" cluster-name=appns/app-db pkg=controller worker=1

Events are more interesting since they contain "FailedMount" and "NetworkNotReady" warnings, which I cannot explain why they would happen.

100m        Warning   FailedMount             pod/app-db-0                                       MountVolume.SetUp failed for volume "kube-api-access-z79rx" : [object "appns"/"kube-root-ca.crt" not registered, object "appns"/"openshift-service-ca.crt" not registered]
100m        Warning   FailedMount             pod/app-db-1                                       MountVolume.SetUp failed for volume "kube-api-access-vl99g" : [object "appns"/"kube-root-ca.crt" not registered, object "appns"/"openshift-service-ca.crt" not registered]
100m        Warning   NetworkNotReady         pod/app-db-1                                       network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
100m        Warning   NetworkNotReady         pod/app-db-0                                       network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
99m         Warning   FailedMount             pod/app-db-0                                       MountVolume.MountDevice failed for volume "pvc-e51cbbaf-6d9f-4baf-83b0-b7f46a53f1f1" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name kubevirt.io.hostpath-provisioner not found in the list of registered CSI drivers
99m         Warning   FailedMount             pod/app-db-1                                       MountVolume.MountDevice failed for volume "pvc-1be09b97-42e7-46ed-9151-7a5ca99fc5ac" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name kubevirt.io.hostpath-provisioner not found in the list of registered CSI drivers
99m         Normal    AddedInterface          pod/app-db-0                                       Add eth0 [10.217.0.65/23] from openshift-sdn
99m         Normal    Pulled                  pod/app-db-0                                       Container image "ghcr.io/zalando/spilo-15:3.0-p1" already present on machine
99m         Normal    Started                 pod/app-db-0                                       Started container postgres
99m         Normal    Created                 pod/app-db-0                                       Created container postgres
99m         Normal    Pulled                  pod/app-db-1                                       Container image "ghcr.io/zalando/spilo-15:3.0-p1" already present on machine
99m         Normal    AddedInterface          pod/app-db-1                                       Add eth0 [10.217.0.70/23] from openshift-sdn
99m         Normal    Started                 pod/app-db-1                                       Started container postgres
99m         Normal    Created                 pod/app-db-1                                       Created container postgres
53m         Normal    Delete                  postgresql/app-db                                  Started deletion of new cluster resources
53m         Normal    Killing                 pod/app-db-0                                       Stopping container postgres
50m         Normal    Killing                 pod/app-db-1                                       Stopping container postgres

I had created another Postgresql database successfully like a minute before the above problem happened, which makes the problem even more strange.

Operator logs of successful db creation:

time="2024-05-22T09:34:48Z" level=info msg="ADD event has been queued" cluster-name=appns/main-db pkg=controller worker=0
time="2024-05-22T09:34:48Z" level=info msg="creating a new Postgres cluster" cluster-name=appns/main-db pkg=controller worker=0
time="2024-05-22T09:34:49Z" level=info msg="master service \"appns/main-db\" has been successfully created" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=info msg="replica service \"appns/main-db-repl\" has been successfully created" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=debug msg="team API is disabled" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=debug msg="team API is disabled" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=info msg="users have been initialized" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=info msg="syncing secrets" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=debug msg="created new secret appns/ardome.main-db.credentials.postgresql.acid.zalan.do, namespace: appns, uid: 8649dc23-8344-430c-b0b8-9ad3c1bdfbbb" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=debug msg="created new secret appns/postgres.main-db.credentials.postgresql.acid.zalan.do, namespace: appns, uid: 8c598bb5-006d-48e6-95d0-2a67741bc4e3" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=debug msg="created new secret appns/standby.main-db.credentials.postgresql.acid.zalan.do, namespace: appns, uid: 53e89029-b44d-48c0-97be-9a4e95ba44da" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=info msg="secrets have been successfully created" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=info msg="pod disruption budget \"appns/postgres-main-db-pdb\" has been successfully created" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=debug msg="created new statefulset \"appns/main-db\", uid: \"400bf089-6931-4411-ba16-84b300c15d9e\"" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=info msg="statefulset \"appns/main-db\" has been successfully created" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:49Z" level=info msg="waiting for the cluster being ready" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:34:55Z" level=debug msg="Waiting for 2 pods to become ready" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:43Z" level=info msg="pods are ready" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:43Z" level=info msg="Create roles" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:43Z" level=debug msg="closing database connection" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:43Z" level=info msg="users have been successfully created" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="creating database \"mdb\" owner \"mdbuser\"" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=debug msg="closing database connection" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="databases have been successfully created" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="found pod disruption budget: \"appns/postgres-main-db-pdb\" (uid: \"6b353b91-26b1-4bda-b096-6fe07596e875\")" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="found statefulset: \"appns/main-db\" (uid: \"400bf089-6931-4411-ba16-84b300c15d9e\")" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="found secret: \"appns/standby.main-db.credentials.postgresql.acid.zalan.do\" (uid: \"53e89029-b44d-48c0-97be-9a4e95ba44da\") namesapce: appns" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="found secret: \"appns/ardome.main-db.credentials.postgresql.acid.zalan.do\" (uid: \"8649dc23-8344-430c-b0b8-9ad3c1bdfbbb\") namesapce: appns" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="found secret: \"appns/postgres.main-db.credentials.postgresql.acid.zalan.do\" (uid: \"8c598bb5-006d-48e6-95d0-2a67741bc4e3\") namesapce: appns" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="found master service: \"appns/main-db\" (uid: \"3419103e-b7ca-4359-8550-0872a7936623\")" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="found replica service: \"appns/main-db-repl\" (uid: \"3d84525a-afa5-4e28-88da-a3b1e2a78877\")" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="found pod: \"appns/main-db-0\" (uid: \"0edb066f-86ee-4cad-ae32-6d3d6b2b65c9\")" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="found pod: \"appns/main-db-1\" (uid: \"7c3ec328-aea3-4d8c-bb40-a67bd3d0e181\")" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="found PVC: \"appns/pgdata-main-db-0\" (uid: \"226a6f4e-8b1e-48c1-acd1-84065545f64d\")" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="found PVC: \"appns/pgdata-main-db-1\" (uid: \"afa44811-13b4-4a99-8161-7b7a85d19e87\")" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=debug msg="syncing connection pooler (master, replica) from (false, nil) to (false, nil)" cluster-name=appns/main-db pkg=cluster worker=0
time="2024-05-22T09:35:44Z" level=info msg="cluster has been created" cluster-name=appns/main-db pkg=controller worker=0

The problematic database is just a test/dev database so I ended up deleting it and recreating it, and that worked fine.

bergner avatar May 22 '24 10:05 bergner

@FxKu Hi Still facing the same issue in 2025. I got the same problem as bergner. And the reason is unknown. I am using Zalango postgres operator version 1.10.1. Is there any updates or fixes?

Razher avatar Jan 17 '25 11:01 Razher