postgres-operator
                                
                                 postgres-operator copied to clipboard
                                
                                    postgres-operator copied to clipboard
                            
                            
                            
                        Postgres pods restart with network delay
Overview
I am testing how Crunchy Postgres pods react to network delay. The testing tool is NetworkChaos in Chaos Mesh. The test result shows: the pods will restart if network delay is greater than 3 seconds.
Environment
- Platform: OpenShift
- Platform Version: 4.10.32
- PGO Image Tag: ubi8-5.2.0-0
- Postgres Version: ubi8-14.4-0
Steps to reproduce
There is one master pod and one replica pod.
$ oc get pods --selector=postgres-operator.crunchydata.com/instance-set -L postgres-operator.crunchydata.com/role
NAME                    READY   STATUS    RESTARTS       AGE   ROLE
pnst-instance1-fpz6-0   5/5     Running   22 (61m ago)   24h   master
pnst-instance1-rpbq-0   5/5     Running   22 (62m ago)   24h   replica
$ 
Check the settings of livenessProbe and readinessProbe
$ oc describe pod pnst-instance1-fpz6-0 | grep Liveness
    Liveness:   http-get https://:8008/liveness delay=3s timeout=5s period=10s #success=1 #failure=3
    Liveness:       exec [pgbackrest server-ping] delay=0s timeout=1s period=10s #success=1 #failure=3
  Warning  Unhealthy  75m (x24 over 7h6m)   kubelet  Liveness probe failed: Get "https://172.17.27.164:8008/liveness": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
$ oc describe pod pnst-instance1-rpbq-0 | grep Liveness
    Liveness:   http-get https://:8008/liveness delay=3s timeout=5s period=10s #success=1 #failure=3
    Liveness:       exec [pgbackrest server-ping] delay=0s timeout=1s period=10s #success=1 #failure=3
  Warning  Unhealthy  68m (x27 over 7h7m)   kubelet  Liveness probe failed: Get "https://172.17.25.75:8008/liveness": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
$ oc describe pod pnst-instance1-fpz6-0 | grep Readiness
    Readiness:  http-get https://:8008/readiness delay=3s timeout=5s period=10s #success=1 #failure=3
  Warning  Unhealthy  142m                  kubelet  Readiness probe failed: Get "https://172.17.27.164:8008/readiness": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  69m (x51 over 7h8m)   kubelet  Readiness probe failed: Get "https://172.17.27.164:8008/readiness": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
$ oc describe pod pnst-instance1-rpbq-0 | grep Readiness
    Readiness:  http-get https://:8008/readiness delay=3s timeout=5s period=10s #success=1 #failure=3
  Warning  Unhealthy  74m (x41 over 7h8m)   kubelet  Readiness probe failed: Get "https://172.17.25.75:8008/readiness": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
$ 
Because timeout is 5s, I started the testig by using a network latency of 3 seconds in the network connections of the target Pods. The result is: pod will restart with a latency of 3 seconds.
My question is: why the pods restart with a network latency that is less than 5 seconds?
Hello @Eric-zch. Thanks for submitting this question. One thing that would be helpful would be to see your PostgresCluster manifest as well.
Regarding your particular test, do I understand correctly that you've configured the NetworkChaos to have a 3s latency?
If so, I would certainly think there'd be some non-negligible amount of latency on top of that which, perhaps, could be enough to trigger the liveness failures. Have you tested with a lower latency and observed the same issue? Alternatively, have you tried customizing your cluster's liveness probe settings to see if adjusting those would allow the test to pass?
Hello @tjmoore4
Thanks for your reply.
Yes, we configured NetworkChaos to have a 3s latency.
It seems the livenes probe settings could not be updated in Crunchy postgrescluster configuration.
"Periodic probe of container liveness. Container will be restarted if the probe fails. Cannot be updated."
Hi @Eric-zch We are currently reassessing older issues to determine their relevance. To better prioritize the features and fixes required by our users for upcoming CPK v5 releases, we are identifying which CPK v5 issues, use cases, and enhancement requests remain valid, especially in the context of the latest CPK v5 release.
As we haven't received any updates on this issue for some time, we are closing it now. If you require further assistance or if this issue is still relevant to the latest CPK v5 release, please feel free to reopen this issue or ask a question in our Discord server..
For additional information about Crunchy Postgres for Kubernetes v5, including guidance for upgrading to the latest version of CPK v5, please refer to the latest documentation:
https://access.crunchydata.com/documentation/postgres-operator/latest/