postgres-operator Postgres pods restart with network delay

Overview

I am testing how Crunchy Postgres pods react to network delay. The testing tool is NetworkChaos in Chaos Mesh. The test result shows: the pods will restart if network delay is greater than 3 seconds.

Environment

Platform: OpenShift
Platform Version: 4.10.32
PGO Image Tag: ubi8-5.2.0-0
Postgres Version: ubi8-14.4-0

Steps to reproduce

There is one master pod and one replica pod.

$ oc get pods --selector=postgres-operator.crunchydata.com/instance-set -L postgres-operator.crunchydata.com/role
NAME                    READY   STATUS    RESTARTS       AGE   ROLE
pnst-instance1-fpz6-0   5/5     Running   22 (61m ago)   24h   master
pnst-instance1-rpbq-0   5/5     Running   22 (62m ago)   24h   replica
$

Check the settings of livenessProbe and readinessProbe

$ oc describe pod pnst-instance1-fpz6-0 | grep Liveness
    Liveness:   http-get https://:8008/liveness delay=3s timeout=5s period=10s #success=1 #failure=3
    Liveness:       exec [pgbackrest server-ping] delay=0s timeout=1s period=10s #success=1 #failure=3
  Warning  Unhealthy  75m (x24 over 7h6m)   kubelet  Liveness probe failed: Get "https://172.17.27.164:8008/liveness": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
$ oc describe pod pnst-instance1-rpbq-0 | grep Liveness
    Liveness:   http-get https://:8008/liveness delay=3s timeout=5s period=10s #success=1 #failure=3
    Liveness:       exec [pgbackrest server-ping] delay=0s timeout=1s period=10s #success=1 #failure=3
  Warning  Unhealthy  68m (x27 over 7h7m)   kubelet  Liveness probe failed: Get "https://172.17.25.75:8008/liveness": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
$ oc describe pod pnst-instance1-fpz6-0 | grep Readiness
    Readiness:  http-get https://:8008/readiness delay=3s timeout=5s period=10s #success=1 #failure=3
  Warning  Unhealthy  142m                  kubelet  Readiness probe failed: Get "https://172.17.27.164:8008/readiness": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  69m (x51 over 7h8m)   kubelet  Readiness probe failed: Get "https://172.17.27.164:8008/readiness": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
$ oc describe pod pnst-instance1-rpbq-0 | grep Readiness
    Readiness:  http-get https://:8008/readiness delay=3s timeout=5s period=10s #success=1 #failure=3
  Warning  Unhealthy  74m (x41 over 7h8m)   kubelet  Readiness probe failed: Get "https://172.17.25.75:8008/readiness": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
$

Because timeout is 5s, I started the testig by using a network latency of 3 seconds in the network connections of the target Pods. The result is: pod will restart with a latency of 3 seconds.

My question is: why the pods restart with a network latency that is less than 5 seconds?

Oct 21 '22 01:10 Eric-zch

Hello @Eric-zch. Thanks for submitting this question. One thing that would be helpful would be to see your PostgresCluster manifest as well.

Regarding your particular test, do I understand correctly that you've configured the NetworkChaos to have a 3s latency?

If so, I would certainly think there'd be some non-negligible amount of latency on top of that which, perhaps, could be enough to trigger the liveness failures. Have you tested with a lower latency and observed the same issue? Alternatively, have you tried customizing your cluster's liveness probe settings to see if adjusting those would allow the test to pass?

Jun 22 '23 19:06 tjmoore4

Hello @tjmoore4 Thanks for your reply. Yes, we configured NetworkChaos to have a 3s latency.
It seems the livenes probe settings could not be updated in Crunchy postgrescluster configuration. "Periodic probe of container liveness. Container will be restarted if the probe fails. Cannot be updated."

Jul 07 '23 01:07 Eric-zch

Hi @Eric-zch We are currently reassessing older issues to determine their relevance. To better prioritize the features and fixes required by our users for upcoming CPK v5 releases, we are identifying which CPK v5 issues, use cases, and enhancement requests remain valid, especially in the context of the latest CPK v5 release.

As we haven't received any updates on this issue for some time, we are closing it now. If you require further assistance or if this issue is still relevant to the latest CPK v5 release, please feel free to reopen this issue or ask a question in our Discord server..

For additional information about Crunchy Postgres for Kubernetes v5, including guidance for upgrading to the latest version of CPK v5, please refer to the latest documentation:

https://access.crunchydata.com/documentation/postgres-operator/latest/

Apr 30 '24 21:04 ValClarkson

postgres-operator postgres-operator copied to clipboard

Postgres pods restart with network delay

Overview

Environment

Steps to reproduce

postgres-operator
postgres-operator copied to clipboard