postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

POD container restarting due to definition changed in 1.14.0 operator after enabling Wiz

Open Ajith61 opened this issue 7 months ago • 4 comments

  • Which image of the operator are you using? e.g. ghcr.io/zalando/postgres-operator:v1.14.0
  • Type of issue? - question
  • Spilo? - https://github.com/zalando/spilo/releases/tag/3.0-p1

Hi All,

We are facing a container restart issue when we use the 1.14.0 operator. This issue does not occur in version 1.9.0. The postgres/postgres exporter container is restarting during any one of the operator sync intervals (we are not facing this issue in every sync).

STS event :

Events: Type Reason Age From Message


Normal Killing 43m kubelet Container postgres-exporter definition changed, will be restarted Normal Pulling 43m kubelet Pulling image "docker.com/wrouesnel/postgres_exporter:latest@sha256:54bd3ba6bc39a9da2bf382667db4dc249c96e4cfc837dafe91d6cc7d362829e0" Normal Created 43m (x2 over 3d22h) kubelet Created container: postgres-exporter Normal Started 43m (x2 over 3d22h) kubelet Started container postgres-exporter Normal Pulled 43m kubelet Successfully pulled image "docker.com/wrouesnel/postgres_exporter:latest@sha256:54bd3ba6bc39a9da2bf382667db4dc249c96e4cfc837dafe91d6cc7d362829e0" in 1.071s (1.071s including waiting). Image size: 33164884 bytes.

State: Running Started: Mon, 21 Apr 2025 10:37:10 +0530 Last State: Terminated Reason: Error Exit Code: 2 Started: Thu, 17 Apr 2025 13:09:10 +0530 Finished: Mon, 21 Apr 2025 10:37:09 +0530 Ready: True

We are also noticing the pods are recreating with the reason pod not yet restarted due to lazy update.

Operator Log :

time="2025-04-21T06:43:04Z" level=debug msg="syncing pod disruption budgets" cluster-name=pg-pgspilotest3/pg-pgspilotest3 pkg=cluster worker=2 time="2025-04-21T06:43:04Z" level=debug msg="syncing roles" cluster-name=pg-pgspilotest3/pg-pgspilotest3 pkg=cluster worker=2 time="2025-04-21T06:43:11Z" level=debug msg="syncing Patroni config" cluster-name=pg-pgspilotest1/pg-pgspilotest1 pkg=cluster time="2025-04-21T06:43:11Z" level=debug msg="making GET http request: http://192.168.14.210:8008/config" cluster-name=pg-pgspilotest1/pg-pgspilotest1 pkg=cluster time="2025-04-21T06:43:11Z" level=debug msg="making GET http request: http://192.168.14.210:8008/patroni" cluster-name=pg-pgspilotest1/pg-pgspilotest1 pkg=cluster time="2025-04-21T06:43:11Z" level=debug msg="syncing pod disruption budgets" cluster-name=pg-pgspilotest1/pg-pgspilotest1 pkg=cluster time="2025-04-21T06:43:11Z" level=debug msg="syncing roles" cluster-name=pg-pgspilotest1/pg-pgspilotest1 pkg=cluster time="2025-04-21T06:43:11Z" level=info msg="mark rolling update annotation for pg-pgspilotest2-1: reason pod not yet restarted due to lazy update" cluster-name=pg-pgspilotest2/pg-pgspilotest2 pkg=cluster time="2025-04-21T06:43:11Z" level=debug msg="syncing Patroni config" cluster-name=pg-pgspilotest2/pg-pgspilotest2 pkg=cluster time="2025-04-21T06:43:11Z" level=debug msg="making GET http request: http://192.168.38.25:8008/config" cluster-name=pg-pgspilotest2/pg-pgspilotest2 pkg=cluster time="2025-04-21T06:43:11Z" level=debug msg="making GET http request: http://192.168.14.243:8008/config" cluster-name=pg-pgspilotest2/pg-pgspilotest2 pkg=cluster time="2025-04-21T06:43:11Z" level=debug msg="making GET http request: http://192.168.38.25:8008/patroni" cluster-name=pg-pgspilotest2/pg-pgspilotest2 pkg=cluster time="2025-04-21T06:43:11Z" level=debug msg="making GET http request: http://192.168.14.243:8008/patroni" cluster-name=pg-pgspilotest2/pg-pgspilotest2 pkg=cluster time="2025-04-21T06:43:11Z" level=info msg="performing rolling update" cluster-name=pg-pgspilotest2/pg-pgspilotest2 pkg=cluster time="2025-04-21T06:43:11Z" level=info msg="there are 2 pods in the cluster to recreate" cluster-name=pg-pgspilotest2/pg-pgspilotest2 pkg=cluster time="2025-04-21T06:43:11Z" level=debug msg="subscribing to pod "pg-pgspilotest2/pg-pgspilotest2-0"" cluster-name=pg-pgspilotest2/pg-pgspilotest2 pkg=cluster

Ajith61 avatar Apr 21 '25 06:04 Ajith61

@Ajith61 can you find the reason for the rolling update in the operator logs? There must be a diff logged somewhere above around syncing statefulset.

FxKu avatar Apr 23 '25 07:04 FxKu

@Ajith61 can you find the reason for the rolling update in the operator logs? There must be a diff logged somewhere above around syncing statefulset.

Thanks @FxKu for the response. I think this issue might be due to wiz(https://www.wiz.io/solutions/container-and-kubernetes-security) in my cluster. When we enabled Wiz in the cluster, I was facing this issue. I noticed WIZ-related annotation is added in the pod annotation. I'm not sure whether this is causing a container/rolling update during the operator sync of the cluster.

apiVersion: v1 kind: Pod metadata: annotations: cni.projectcalico.org/containerID: e29feb1637352db0e21600085f40d16eaf7dc094488846425d69bfec297c64c7 cni.projectcalico.org/podIP: 192.168.35.167/32 cni.projectcalico.org/podIPs: 192.168.35.167/32 image-integrity-validator.wiz.io-0: docker.com/dev/platform/postgres/spilocustom/test/carbonspilo:1.16->sha256:6a5a3ad3b10c80dcba8a6a1df359d67d55fec24c4f183662bfa84e2e3ec9eee7 creationTimestamp: "2025-04-23T06:43:10Z" generateName: pg-pgspilotest3- labels: application: spilo apps.kubernetes.io/pod-index: "0"

Observations

  1. During operator sync (not every sync, intermittently), the Postgres container restart/rolling update happens after the WIZ is enabled in the cluster.

  2. I'm not facing any issue till the 1.12.2 operator version even though wiz is enabled and running. facing this issue only on 1.13.0 and 1.14.0.

  3. After disabling the wiz in 1.13.0/1.14.0 operator, I don't see any issue. I think Wiz is causing the issue here. operator doing the rolling update/container restart when there are changes in the pod annotation, it seems. Could you please let me know how to avoid this issue in the latest operator? Thanks in advance.

Ajith61 avatar Apr 23 '25 08:04 Ajith61

Sorry for late reply. This might have to do with how we compare annotations starting from v1.13.0. If Wiz adds an extra annotation which you want to ignore on diff you have to add it to the ignored_annotations config option.

FxKu avatar May 13 '25 09:05 FxKu

@FxKu Thanks for the response.

I added the annotations under ignored_annotations flag in operator config file like below.Its not working.

I added the annotation key alone image-integrity-validator.wiz.io-0 and full annotation as well.

The annotation has "->" char after the tag in image which converted to \u003e while init the operator. so I think due to the string mismatch the annotations are not being ignored. Any idea how we can fix this?

U+003E > Greater-than sign

ignored_annotations: - image-integrity-validator.wiz.io-0 - "image-integrity-validator.wiz.io-0: docker.com/dev/platform/postgres/spilocustom/test/ carbonspilo:1.16->sha256:6a5a3ad3b10c80dcba8a6a1df359d67d55fec24c4f183662bfa84e2e3ec9eee7"

operator log :

time="2025-05-05T09:27:42Z" level=info msg=" "image-integrity-validator.wiz.io-0"," pkg=controller time="2025-05-05T09:27:42Z" level=info msg=" "docker.com/dev/platform/postgres/spilocustom/test/carbonspilo:1.16-\u003esha256:6a5a3ad3b10c80dcba8a6a1df359d67d55fec24c4f183662bfa84e2e3ec9eee7"," pkg=controller

Ajith61 avatar May 16 '25 07:05 Ajith61