noobaa-core
noobaa-core copied to clipboard
Failover of noobaa-db-pg0 works but fall back leaves pod in CrashLoopback
Environment info
INFO[0000] CLI version: 5.9.0 INFO[0000] noobaa-image: noobaa/noobaa-core:nsfs_backport_5.9-20220331 INFO[0000] operator-image: quay.io/rhceph-dev/odf4-mcg-rhel8-operator@sha256:eb56f52778b71a04b7df62db4ea241c65f84e2b80ddcc9b588414d173d7e086c INFO[0000] noobaa-db-image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e OCP 4.9.29 ODF 4.9.6-1 w/ patch 3/31
Actual behavior
The node, c83f1-dan3.ocp4.pokprv.stglabs.ibm.com was hosting the pods, noobaa-db-pg-0 and noobaa-default-backing-store-noobaa-pod-6b3cc266
I ssh'ed into node c834f1-dan3 and ran:
core@c83f1-dan3 ~]$ date; sudo systemctl stop kubelet ; date
Wed May 4 17:33:25 UTC 2022
Wed May 4 17:33:25 UTC 2022
[core@c83f1-dan3 ~]$
All pods failed over to either dan2 or dan1 after about 7 minutes. The pod noobaa-db-pg-0 failed over to dan2 and noobaa-default-backing-store to dan1. They go to running state and s3 and mmdas and I/O continues to run.
The fall back failed for noobaa-db-pg-0.
[core@c83f1-dan3 ~]$ date; sudo systemctl start kubelet ; date
Wed May 4 17:47:23 UTC 2022
Wed May 4 17:47:31 UTC 2022
The noobaa-default-baking-store stays on dan1. The Noobaa-db-pg-0 goes to crashloopbackoff.
NAME READY STATUS RESTARTS AGE
IP NODE NOMINATED NODE READINESS GATES
noobaa-core-0 1/1 Running 0 4h33m
10.128.2.178 c83f1-dan2.ocp4.pokprv.stglabs.ibm.com <none> <none>
noobaa-db-pg-0 0/1 CrashLoopBackOff 18 (4m8s ago) 85m
10.128.2.185 c83f1-dan2.ocp4.pokprv.stglabs.ibm.com <none> <none>
noobaa-default-backing-store-noobaa-pod-6b3cc266 1/1 Running 4 (72m ago) 85m
10.128.5.242 c83f1-dan1.ocp4.pokprv.stglabs.ibm.com <none> <none>
noobaa-endpoint-6546dfb677-4r6gq 1/1 Running 0 4h33m
10.128.2.177 c83f1-dan2.ocp4.pokprv.stglabs.ibm.com <none> <none>
noobaa-endpoint-6546dfb677-8n7qp 1/1 Running 0 4h33m
10.128.5.208 c83f1-dan1.ocp4.pokprv.stglabs.ibm.com <none> <none>
noobaa-endpoint-6546dfb677-9np5x 1/1 Running 0 71m
10.128.1.47 c83f1-dan3.ocp4.pokprv.stglabs.ibm.com <none> <none>
noobaa-endpoint-6546dfb677-hh4wm 1/1 Running 1 (71m ago) 71m
10.128.1.46 c83f1-dan3.ocp4.pokprv.stglabs.ibm.com <none> <none>
noobaa-endpoint-6546dfb677-jxt6q 1/1 Running 0 71m
10.128.1.48 c83f1-dan3.ocp4.pokprv.stglabs.ibm.com <none> <none>
Describe of pod shows:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 24m default-scheduler Successfully assigned openshift-storage/noobaa-db-pg-0 to [c83f1-dan2.ocp4.pokprv.stglabs.ibm.com](http://c83f1-dan2.ocp4.pokprv.stglabs.ibm.com/)
Warning FailedAttachVolume 24m attachdetach-controller Multi-Attach error for volume "pvc-985f5aa7-967b-4bcc-9269-05eb17240b80" Volume is already exclusively attached to one node and can't be attached to another
Warning FailedMount 22m kubelet Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[noobaa-postgres-initdb-sh-volume noobaa-postgres-config-volume db kube-api-access-k6mpn]: timed out waiting for the condition
Warning FailedMount 20m kubelet Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[db kube-api-access-k6mpn noobaa-postgres-initdb-sh-volume noobaa-postgres-config-volume]: timed out waiting for the condition
Normal SuccessfulAttachVolume 18m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-985f5aa7-967b-4bcc-9269-05eb17240b80"
Normal Pulled 18m kubelet Container image "noobaa/noobaa-core:nsfs_backport_5.9-20220331" already present on machine
Normal AddedInterface 18m multus Add eth0 [10.128.2.185/23] from openshift-sdn
Normal Created 18m kubelet Created container init
Normal Pulled 18m kubelet Container image "[quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e](http://quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e)" already present on machine
Normal Started 18m kubelet Started container init
Normal Created 18m kubelet Created container initialize-database
Normal Started 18m kubelet Started container initialize-database
Normal Created 9m31s (x4 over 18m) kubelet Created container db
Normal Started 9m31s (x4 over 18m) kubelet Started container db
Normal Pulled 7m23s (x6 over 18m) kubelet Container image "[quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e](http://quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e)" already present on machine
Warning BackOff 2m27s (x36 over 10m) kubelet Back-off restarting failed container
[root@c83f1-infa ~]#
Logs on the failing pod:
oc logs noobaa-db-pg-0
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2022-05-04 18:04:22.856 UTC [23] LOG: starting PostgreSQL 12.9 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), 64-bit
2022-05-04 18:04:22.857 UTC [23] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-05-04 18:04:22.859 UTC [23] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-05-04 18:04:23.043 UTC [23] LOG: redirecting log output to logging collector process
2022-05-04 18:04:23.043 UTC [23] HINT: Future log output will appear in directory "log".
done
server started
/var/run/postgresql:5432 - accepting connections
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ERROR: tuple concurrently updated
It does not recover on its own. The only way to recover is to do delete the s3 service. All accounts and exports are lost unless a backup was done prior to this event
Noobaa diagnose and must-gather are in box note and labeled with this defect number https://ibm.ent.box.com/folder/145794528783?s=uueh7fp424vxs2bt4ndrnvh7uusgu6tocd
Not sure if this defect that I am reporting is related in any way to. https://github.com/sclorg/postgresql-container/issues/166
Expected behavior
- Fall back should occur.
Steps to reproduce
Listed above
More information - Screenshots / Logs / Other output
This will be a blocker to our testing b/c I am assigned failvoer/fallback testing.
Romy debugged this issue on other cluster and find a link relate to this https://github.com/sclorg/postgresql-container/issues/166
@MonicaLemay , we shouldn't do stop Kubectl stop and start type of testing. There are known defects in CSI which are not fixed for our MVP.
Please read E2E on this for reference --> https://github.com/IBM/ibm-spectrum-scale-csi/issues/563 Also look at Noobaa-defect 6870 last 10-15 comments
Will share with you list of defects, which you can go through on this Failover/Failback
Failover/Failback scenario has to be done over the node power on/off method only
I was NOT able to recreate with ipmitool power off. However, the same conditions did not exist but that might not matter. In my test listed above, both noobaa-db-pg-o and the noobaa-default-backing-store-noobaa-pod were hosted by the same node. I don't have that condition any more.
@MonicaLemay , I don't think the backing store will matter if you are done via ipmi tool. Yes, it does matter if you have done with kubectl command as mentioned the CSI driver volume will come into play for the pvc to be Active. However there is no real use for us w/r/t backingstore as Nimrod did say they are planned not to have that one for future ODF releases as per few months back discussion.
Also, if you want to get the pods on the same node, you could try deleting one of them few times and they might get onto the same node. This is what was tried for noobaa core and noobaa-db pod's on the same node for Failover/Failback testing
If this is not recreatable when we do ipmitool power off then power on. AND is only recreateable with systemctl stop kubectl , then this might not be a sev1. Can someone weigh in on this so that we can better understand our priorities??
@MonicaLemay, how are you?
Regarding the noobaa-db-pg-0 db container producing ERROR: tuple concurrently updated
According to noobaa-db-pg-0-db.log, noobaa-db-pg-0-db-previous.log the start up script warns about pg_ctl: another server might be running; trying to start server anyway
.
To troubleshoot, could you please try the manual DB volume recovery procedure described here?
Could you let me know if you see the Postgres server starting normally, after scaling the DB stateful set replicas back to 1 at the end of that procedure?
noobaa-db-pg-0 0/1 Pending 0 0s
noobaa-db-pg-0 0/1 Init:0/2 0 0s
noobaa-db-pg-0 0/1 Init:1/2 0 1s
noobaa-db-pg-0 0/1 PodInitializing 0 4s
noobaa-db-pg-0 1/1 Running 0 5s
Hi @MonicaLemay , Please confirm if this is a Non-GA blocker. Thanks
@akmithal This is not a GA blocker.
@neeshaPan , Provide the steps that are tried. if this is still a postgres related issue, Noobaa team need to analyze
we followed below link for getting out of crashloopback state. https://serverfault.com/questions/942743/postgres-crash-loop-caused-by-a-tuple-concurrently-updated-error
After bringing the noobaa-db pod up using above link, mmdas commands / Rest-api's are not working as expected. we are hitting error "something went wrong" when trying to run any of the mmdas commands.
@NeeshaPan , when you try next time, update the steps that are relevant for the noobaa pods by trying when the noobaa-db pod running node , the service is made down on Fyre and update if the same behavior is observed.
I tried the failover failback of noobaa-core and noobaa-db pod on same node and also failover failback of noobaa-db pod only on fyre cluster. However I am not able to reproduce this issue on fyre cluster.
@baum . @NeeshaPan will try on the physical system and update it with results as per our last week discussion. System is not functioning as expected since a week or so.
Will try the scenario on BM once it comes up as it is a compact cluster FYI @rkomandu @romayalon
@romayalon I am able to reproduce this issue on BM. It is easily reproducible on BM systems. Attached file that have output of pods & logs and procedure followed for FOFB. OutputsCapturedDuringFOFB.txt
FYI @rkomandu
@baum please see last comment
This was discussed with @baum and the pointer is towards the access to Postgres DB from 2 different nodes. Will need to discuss with CSI team.
@MonicaLemay Could you please update the summary of defect. just need to replace fall back to failback
Replace
Failover of noobaa-db-pg0 works but fall back leaves pod in CrashLoopback
to
Failover of noobaa-db-pg0 works but failback leaves pod in CrashLoopback
FYI @rkomandu
@MonicaLemay , we just thought that "fail back" is appropriate than "fall back" .. Do change only if you agree.
@baum, we had a discussion with Deepak (IBM CSI)team, there are few Questions that we need to get clarity on , in order to make progress on this behavior
-
Why noobaa db was restarted when kubelet on hpo-app7 node was started back ?
-
Why does noobaa db is in crashloopback ? You mentioned that it is because two db writer on different node. Other than noobaa db pod, no other entity like csi driver or spectrum scale writes anything to any volume. So question is when kubelet Service is down, the old noobaa db pod on hpo-app7 node which should be cleaned up by k8s is causing that write when kubelet on hpo-app7 node is started back? I think we need kubenetes/kubelet expert to help here
-
I also noticed one of the comment that this is only reproducible only compact cluster so we should check with expert if shutting down one master node is causing any interference in cleanup process.
@nimrod-becker , @baum We need Noobaa team to check with RH team (K8's expert) if required on the above Questions when (Kubectl Service is down) on a compact cluster.
Note: Earlier discussed with Shane Heroux on the same, posted for reference. Could Alex discuss with Shane and get the ball rolling ?
Email reply from Shane around 16th June 2022
Hi Ravi,
How much impact and the side-affects are if the "kubelet service" is stopped on a compact cluster.
At a high level, stopping the kubelet service on a compact cluster as you described has at least the same impact as taking a master out of the cluster with the addition of whatever workloads are scheduled.
Did any Customer do stop the Kubelet service on a master node (traditional 3M+3W) , if yes, under what circumstances could the service be stopped ?
I'm not sure how to answer the first part of this question, but the service is stopped and masters are rolled out and back into clusters (by way of a reboot) every time they update to a new minor version. Done properly, a cluster can tolerate a single master being removed at any one time, but you would only want to do this when strictly necessary.
There are many resources describing the exact process, but you'd likely want a backup of etcd and cordon / drain the node in question first.
If you need a more specific / deeper answer, I will need more context around what you're trying to accomplish so I can direct the question better.
Hello @rkomandu 🖖
-
The DB pod is scheduled by the corresponding stateful set. Once a node becomes unavailable, according to K8S API server, the NooBaa operator HA controller explicitly deletes the pod on the failing node, causing the DB stateful set to reschedule the DB pod on a healthy node. The NooBaa operator is not involved in scheduling "when kubelet on hpo-app7 node was started back", this all is a normal stateful set behavior. Just a theory: restarting kubelet would allow the API server to communicate again the node previously considered not available, and a stateful set reconciliation is taking place. It is worth considering that stopping the kubelet or explicitly deleting a pod those do not actually stop the pod container process, so once the kublet communication is restored, it is possible the stateful discovers too many pods running ( > replicas ) causing reconciliation.
-
The Postgres process crashes with "ERROR: tuple concurrently updated". This would happen if the DB persistent volume is accessed by several Postgres processes concurrently.
Considering the test performed. After the kublet is stopped, the Postgres on that node continues to run, potentially writing into the persistent volume. Then a new Postgres is scheduled on a healthy node, also potentially writing into the same persistent volume.
Question: what happens to the persistent volume changes on the node where kubelet was stopped, while kublet is not running? Are those changes "merged" with changes on the healthy node? Or those changes are discarded once the communication with the node is restored?
@baum , thank you for your explanation. As mentioned by CSI team, the volume is RWO policy, so it can't be done from multiple nodes at a time.
Will ask Deepak from CSI team to have a look at your above post and respond.
Hi @rkomandu, As I mentioned earlier k8s enforces the RWO, it means there is chance in such scenario where orphan pod on node where kubelet is being restarted can access data since kubelet communication was down and cleanup did not happen. I think here we need to understand impact of shutting down/starting kubelet on master and reconciliation process and possibilities of orphan pod causing data corruption. I think @baum is also hinting the same.
Hi @baum , as per email communication with Shane from RH , Is there any progress made to do further investigate on this behavior ?
@romayalon / @nimrod-becker Please add label as NS-FS here.
@baum
please look at https://bugzilla.redhat.com/show_bug.cgi?id=2010702 (this is similar behavior) we are discussing here on the noobaa db/ pvc
@NeeshaPan @romayalon FYI
The BZ stated was closed as not a bug (not the tracker but the actual BZ it tracked...)
Status: CLOSED → NEW Resolution: NOTABUG → --- CC: [email protected] Keywords: Reopened