noobaa-core Failover of noobaa-db-pg0 works but fall back leaves pod in CrashLoopback

Environment info

INFO[0000] CLI version: 5.9.0 INFO[0000] noobaa-image: noobaa/noobaa-core:nsfs_backport_5.9-20220331 INFO[0000] operator-image: quay.io/rhceph-dev/odf4-mcg-rhel8-operator@sha256:eb56f52778b71a04b7df62db4ea241c65f84e2b80ddcc9b588414d173d7e086c INFO[0000] noobaa-db-image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e OCP 4.9.29 ODF 4.9.6-1 w/ patch 3/31

Actual behavior

The node, c83f1-dan3.ocp4.pokprv.stglabs.ibm.com was hosting the pods, noobaa-db-pg-0 and noobaa-default-backing-store-noobaa-pod-6b3cc266

I ssh'ed into node c834f1-dan3 and ran:

core@c83f1-dan3 ~]$ date; sudo systemctl stop kubelet ; date
Wed May  4 17:33:25 UTC 2022
Wed May  4 17:33:25 UTC 2022
[core@c83f1-dan3 ~]$

All pods failed over to either dan2 or dan1 after about 7 minutes. The pod noobaa-db-pg-0 failed over to dan2 and noobaa-default-backing-store to dan1. They go to running state and s3 and mmdas and I/O continues to run.

The fall back failed for noobaa-db-pg-0.

[core@c83f1-dan3 ~]$ date; sudo systemctl start kubelet ; date
Wed May  4 17:47:23 UTC 2022
Wed May  4 17:47:31 UTC 2022

The noobaa-default-baking-store stays on dan1. The Noobaa-db-pg-0 goes to crashloopbackoff.

NAME                                               READY   STATUS             RESTARTS        AGE
  IP             NODE                                     NOMINATED NODE   READINESS GATES
noobaa-core-0                                      1/1     Running            0               4h33m
  10.128.2.178   c83f1-dan2.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-db-pg-0                                     0/1     CrashLoopBackOff   18 (4m8s ago)   85m
  10.128.2.185   c83f1-dan2.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-default-backing-store-noobaa-pod-6b3cc266   1/1     Running            4 (72m ago)     85m
  10.128.5.242   c83f1-dan1.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-endpoint-6546dfb677-4r6gq                   1/1     Running            0               4h33m
  10.128.2.177   c83f1-dan2.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-endpoint-6546dfb677-8n7qp                   1/1     Running            0               4h33m
  10.128.5.208   c83f1-dan1.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-endpoint-6546dfb677-9np5x                   1/1     Running            0               71m
  10.128.1.47    c83f1-dan3.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-endpoint-6546dfb677-hh4wm                   1/1     Running            1 (71m ago)     71m
  10.128.1.46    c83f1-dan3.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-endpoint-6546dfb677-jxt6q                   1/1     Running            0               71m
  10.128.1.48    c83f1-dan3.ocp4.pokprv.stglabs.ibm.com   <none>           <none>

Describe of pod shows:

Type   Reason         Age          From           Message
 ----   ------         ----         ----           -------
 Normal  Scheduled        24m          default-scheduler    Successfully assigned openshift-storage/noobaa-db-pg-0 to [c83f1-dan2.ocp4.pokprv.stglabs.ibm.com](http://c83f1-dan2.ocp4.pokprv.stglabs.ibm.com/)
 Warning FailedAttachVolume   24m          attachdetach-controller Multi-Attach error for volume "pvc-985f5aa7-967b-4bcc-9269-05eb17240b80" Volume is already exclusively attached to one node and can't be attached to another
 Warning FailedMount       22m          kubelet         Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[noobaa-postgres-initdb-sh-volume noobaa-postgres-config-volume db kube-api-access-k6mpn]: timed out waiting for the condition
 Warning FailedMount       20m          kubelet         Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[db kube-api-access-k6mpn noobaa-postgres-initdb-sh-volume noobaa-postgres-config-volume]: timed out waiting for the condition
 Normal  SuccessfulAttachVolume 18m          attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-985f5aa7-967b-4bcc-9269-05eb17240b80"
 Normal  Pulled         18m          kubelet         Container image "noobaa/noobaa-core:nsfs_backport_5.9-20220331" already present on machine
 Normal  AddedInterface     18m          multus          Add eth0 [10.128.2.185/23] from openshift-sdn
 Normal  Created         18m          kubelet         Created container init
 Normal  Pulled         18m          kubelet         Container image "[quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e](http://quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e)" already present on machine
 Normal  Started         18m          kubelet         Started container init
 Normal  Created         18m          kubelet         Created container initialize-database
 Normal  Started         18m          kubelet         Started container initialize-database
 Normal  Created         9m31s (x4 over 18m)  kubelet         Created container db
 Normal  Started         9m31s (x4 over 18m)  kubelet         Started container db
 Normal  Pulled         7m23s (x6 over 18m)  kubelet         Container image "[quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e](http://quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e)" already present on machine
 Warning BackOff         2m27s (x36 over 10m) kubelet         Back-off restarting failed container
[root@c83f1-infa ~]#

Logs on the failing pod:

oc logs noobaa-db-pg-0
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2022-05-04 18:04:22.856 UTC [23] LOG:  starting PostgreSQL 12.9 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), 64-bit
2022-05-04 18:04:22.857 UTC [23] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-05-04 18:04:22.859 UTC [23] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-05-04 18:04:23.043 UTC [23] LOG:  redirecting log output to logging collector process
2022-05-04 18:04:23.043 UTC [23] HINT:  Future log output will appear in directory "log".
 done
server started
/var/run/postgresql:5432 - accepting connections
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ERROR:  tuple concurrently updated

It does not recover on its own. The only way to recover is to do delete the s3 service. All accounts and exports are lost unless a backup was done prior to this event

Noobaa diagnose and must-gather are in box note and labeled with this defect number https://ibm.ent.box.com/folder/145794528783?s=uueh7fp424vxs2bt4ndrnvh7uusgu6tocd

Not sure if this defect that I am reporting is related in any way to. https://github.com/sclorg/postgresql-container/issues/166

Expected behavior

Fall back should occur.

Steps to reproduce

Listed above

More information - Screenshots / Logs / Other output

May 04 '22 19:05 MonicaLemay

This will be a blocker to our testing b/c I am assigned failvoer/fallback testing.

May 04 '22 19:05 MonicaLemay

Romy debugged this issue on other cluster and find a link relate to this https://github.com/sclorg/postgresql-container/issues/166

May 04 '22 19:05 cxhong

@MonicaLemay , we shouldn't do stop Kubectl stop and start type of testing. There are known defects in CSI which are not fixed for our MVP.

Please read E2E on this for reference --> https://github.com/IBM/ibm-spectrum-scale-csi/issues/563 Also look at Noobaa-defect 6870 last 10-15 comments

Will share with you list of defects, which you can go through on this Failover/Failback

Failover/Failback scenario has to be done over the node power on/off method only

May 05 '22 07:05 rkomandu

I was NOT able to recreate with ipmitool power off. However, the same conditions did not exist but that might not matter. In my test listed above, both noobaa-db-pg-o and the noobaa-default-backing-store-noobaa-pod were hosted by the same node. I don't have that condition any more.

May 05 '22 19:05 MonicaLemay

@MonicaLemay , I don't think the backing store will matter if you are done via ipmi tool. Yes, it does matter if you have done with kubectl command as mentioned the CSI driver volume will come into play for the pvc to be Active. However there is no real use for us w/r/t backingstore as Nimrod did say they are planned not to have that one for future ODF releases as per few months back discussion.

Also, if you want to get the pods on the same node, you could try deleting one of them few times and they might get onto the same node. This is what was tried for noobaa core and noobaa-db pod's on the same node for Failover/Failback testing

May 06 '22 05:05 rkomandu

If this is not recreatable when we do ipmitool power off then power on. AND is only recreateable with systemctl stop kubectl , then this might not be a sev1. Can someone weigh in on this so that we can better understand our priorities??

May 06 '22 13:05 MonicaLemay

@MonicaLemay, how are you?

Regarding the noobaa-db-pg-0 db container producing ERROR: tuple concurrently updated

According to noobaa-db-pg-0-db.log, noobaa-db-pg-0-db-previous.log the start up script warns about pg_ctl: another server might be running; trying to start server anyway.

To troubleshoot, could you please try the manual DB volume recovery procedure described here?

Could you let me know if you see the Postgres server starting normally, after scaling the DB stateful set replicas back to 1 at the end of that procedure?

noobaa-db-pg-0                                     0/1     Pending             0          0s
noobaa-db-pg-0                                     0/1     Init:0/2            0          0s
noobaa-db-pg-0                                     0/1     Init:1/2            0          1s
noobaa-db-pg-0                                     0/1     PodInitializing     0          4s
noobaa-db-pg-0                                     1/1     Running             0          5s

May 08 '22 09:05 baum

Hi @MonicaLemay , Please confirm if this is a Non-GA blocker. Thanks

May 19 '22 09:05 akmithal

@akmithal This is not a GA blocker.

May 19 '22 12:05 MonicaLemay

@neeshaPan , Provide the steps that are tried. if this is still a postgres related issue, Noobaa team need to analyze

Jul 14 '22 08:07 rkomandu

we followed below link for getting out of crashloopback state. https://serverfault.com/questions/942743/postgres-crash-loop-caused-by-a-tuple-concurrently-updated-error

Jul 14 '22 10:07 NeeshaPan

After bringing the noobaa-db pod up using above link, mmdas commands / Rest-api's are not working as expected. we are hitting error "something went wrong" when trying to run any of the mmdas commands.

Jul 18 '22 13:07 NeeshaPan

@NeeshaPan , when you try next time, update the steps that are relevant for the noobaa pods by trying when the noobaa-db pod running node , the service is made down on Fyre and update if the same behavior is observed.

Jul 19 '22 10:07 rkomandu

I tried the failover failback of noobaa-core and noobaa-db pod on same node and also failover failback of noobaa-db pod only on fyre cluster. However I am not able to reproduce this issue on fyre cluster.

Jul 26 '22 08:07 NeeshaPan

@baum . @NeeshaPan will try on the physical system and update it with results as per our last week discussion. System is not functioning as expected since a week or so.

Jul 26 '22 09:07 rkomandu

Will try the scenario on BM once it comes up as it is a compact cluster FYI @rkomandu @romayalon

Jul 26 '22 10:07 NeeshaPan

@romayalon I am able to reproduce this issue on BM. It is easily reproducible on BM systems. Attached file that have output of pods & logs and procedure followed for FOFB. OutputsCapturedDuringFOFB.txt

FYI @rkomandu

Aug 05 '22 07:08 NeeshaPan

@baum please see last comment

Aug 07 '22 07:08 romayalon

This was discussed with @baum and the pointer is towards the access to Postgres DB from 2 different nodes. Will need to discuss with CSI team.

Sep 06 '22 04:09 rkomandu

@MonicaLemay Could you please update the summary of defect. just need to replace fall back to failback Replace Failover of noobaa-db-pg0 works but fall back leaves pod in CrashLoopback to

Failover of noobaa-db-pg0 works but failback leaves pod in CrashLoopback

FYI @rkomandu

Sep 06 '22 14:09 NeeshaPan

@MonicaLemay , we just thought that "fail back" is appropriate than "fall back" .. Do change only if you agree.

Sep 07 '22 06:09 rkomandu

@baum, we had a discussion with Deepak (IBM CSI)team, there are few Questions that we need to get clarity on , in order to make progress on this behavior

Why noobaa db was restarted when kubelet on hpo-app7 node was started back ?

Why does noobaa db is in crashloopback ? You mentioned that it is because two db writer on different node. Other than noobaa db pod, no other entity like csi driver or spectrum scale writes anything to any volume. So question is when kubelet Service is down, the old noobaa db pod on hpo-app7 node which should be cleaned up by k8s is causing that write when kubelet on hpo-app7 node  is started back? I think we need kubenetes/kubelet expert to help here

I also noticed one of the comment that this is only reproducible only compact cluster so we should check with expert if shutting down one master node is causing any interference in cleanup process.

@nimrod-becker , @baum We need Noobaa team to check with RH team (K8's expert) if required on the above Questions when (Kubectl Service is down) on a compact cluster.

Note: Earlier discussed with Shane Heroux on the same, posted for reference. Could Alex discuss with Shane and get the ball rolling ?

Email reply from Shane around 16th June 2022 
Hi Ravi,

How much impact and the side-affects are if the "kubelet service" is stopped on a compact cluster.

At a high level, stopping the kubelet service on a compact cluster as you described has at least the same impact as taking a master out of the cluster with the addition of whatever workloads are scheduled.

 Did any Customer do stop the Kubelet service on a master node (traditional 3M+3W) , if yes, under what circumstances could the service be stopped ?

I'm not sure how to answer the first part of this question, but the service is stopped and masters are rolled out and back into clusters (by way of a reboot) every time they update to a new minor version. Done properly, a cluster can tolerate a single master being removed at any one time, but you would only want to do this when strictly necessary. 

There are many resources describing the exact process, but you'd likely want a backup of etcd and cordon / drain the node in question first.

If you need a more specific / deeper answer, I will need more context around what you're trying to accomplish so I can direct the question better.

Sep 07 '22 06:09 rkomandu

Hello @rkomandu 🖖

The DB pod is scheduled by the corresponding stateful set. Once a node becomes unavailable, according to K8S API server, the NooBaa operator HA controller explicitly deletes the pod on the failing node, causing the DB stateful set to reschedule the DB pod on a healthy node. The NooBaa operator is not involved in scheduling "when kubelet on hpo-app7 node was started back", this all is a normal stateful set behavior. Just a theory: restarting kubelet would allow the API server to communicate again the node previously considered not available, and a stateful set reconciliation is taking place. It is worth considering that stopping the kubelet or explicitly deleting a pod those do not actually stop the pod container process, so once the kublet communication is restored, it is possible the stateful discovers too many pods running ( > replicas ) causing reconciliation.
The Postgres process crashes with "ERROR: tuple concurrently updated". This would happen if the DB persistent volume is accessed by several Postgres processes concurrently.

Considering the test performed. After the kublet is stopped, the Postgres on that node continues to run, potentially writing into the persistent volume. Then a new Postgres is scheduled on a healthy node, also potentially writing into the same persistent volume.

Question: what happens to the persistent volume changes on the node where kubelet was stopped, while kublet is not running? Are those changes "merged" with changes on the healthy node? Or those changes are discarded once the communication with the node is restored?

Sep 08 '22 13:09 baum

@baum , thank you for your explanation. As mentioned by CSI team, the volume is RWO policy, so it can't be done from multiple nodes at a time.

Will ask Deepak from CSI team to have a look at your above post and respond.

Sep 09 '22 05:09 rkomandu

Hi @rkomandu, As I mentioned earlier k8s enforces the RWO, it means there is chance in such scenario where orphan pod on node where kubelet is being restarted can access data since kubelet communication was down and cleanup did not happen. I think here we need to understand impact of shutting down/starting kubelet on master and reconciliation process and possibilities of orphan pod causing data corruption. I think @baum is also hinting the same.

Sep 12 '22 12:09 deeghuge

Hi @baum , as per email communication with Shane from RH , Is there any progress made to do further investigate on this behavior ?

Sep 29 '22 09:09 rkomandu

@romayalon / @nimrod-becker Please add label as NS-FS here.

Sep 29 '22 09:09 rkomandu

@baum

please look at https://bugzilla.redhat.com/show_bug.cgi?id=2010702 (this is similar behavior) we are discussing here on the noobaa db/ pvc

@NeeshaPan @romayalon FYI

Sep 29 '22 09:09 rkomandu

The BZ stated was closed as not a bug (not the tracker but the actual BZ it tracked...)

Sep 29 '22 09:09 nimrod-becker

Status: CLOSED → NEW Resolution: NOTABUG → --- CC: [email protected] Keywords: Reopened

Sep 29 '22 09:09 rkomandu

noobaa-core noobaa-core copied to clipboard

Failover of noobaa-db-pg0 works but fall back leaves pod in CrashLoopback

Environment info

Actual behavior

Expected behavior

Steps to reproduce

More information - Screenshots / Logs / Other output

noobaa-core
noobaa-core copied to clipboard