noobaa-core
noobaa-core copied to clipboard
HPO 590 follow-up for discussion: noobaa-db-pg-0 does not failover if its host worker node has a network bond that goes down
Environment info
- NooBaa Version: VERSION INFO[0000] CLI version: 5.9.0 INFO[0000] noobaa-image: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:ef1dc9679ba33ad449f29ab4930bd8b1e3d717ebb29cca855dab3749dbb6d8e4 INFO[0000] operator-image: quay.io/rhceph-dev/odf4-mcg-rhel8-operator@sha256:01a31a47a43f01c333981056526317dfec70d1072dbd335c8386e0b3f63ef052 INFO[0000] noobaa-db-image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:98990a28bec6aa05b70411ea5bd9c332939aea02d9d61eedf7422a32cfa0be54 INFO[0000] Namespace: openshift-storage
- Platform:
[root@c83f1-infa ~]# oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.9.5 NooBaa Operator 4.9.5 mcg-operator.v4.9.4 Succeeded ocs-operator.v4.9.5 OpenShift Container Storage 4.9.5 ocs-operator.v4.9.4 Succeeded odf-operator.v4.9.5 OpenShift Data Foundation 4.9.5 odf-operator.v4.9.4 Succeeded [root@c83f1-infa ~]#
Actual behavior
Bug Description
This is a follow-up to noobaa from this HPO defect. https://github.ibm.com/IBMSpectrumScale/hpo-core/issues/590
I am pasting 590 below, but to help the reader of this noobaa defect, I am pasting the last comment from Ulf and Nimrod who requested that this defect be open as a discussion place for this type of outage as well as loss of PVC .
TROPPENS commented 2 days ago
I had a call with Nimrod to discuss resiliency against loss of high-speed network. Current NooBaa does not have any resiliency against the loss of the high-speed network and the loss of the PVC. He suggested to create a bug in the NooBaa GH to start a discussion on potential enhancements.
This is a paste from HPO 590:
Bug Description
Our bare metal cluster in the POK lab has a problem where one of the network bonds goes to state down.
The node is dan1. The noobaa-db-pg-0 is hosted by dan1.
When the bond goes down, the noobaa-db-pg-0 pod does not fail over to another node. The database is not available when this condition occurs and is an I/O outage
Logs are in the https://ibm.ent.box.com/folder/145794528783?s=uueh7fp424vxs2bt4ndrnvh7uusgu6tocd The are labeled as hpo 590
Steps to reproduce
Disable a network bond
Expected behaviour
the noobaa db pod should failover to another node but not the CSI attacher node
nilesh-bhosale commented 14 days ago
With bond going down, does the OCP node status become 'NotReady'? I believe, only when the node state becomes NotReady, k8s will trigger a pod fail over to one of the available (Ready) nodes in the cluster.
Monica commented below: As I noticed, the OCP node still is in the ready state, the OCP nodes are using provision network, those interfaces are up and running.
# oc get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
c83f1-dan1.ocp4.pokprv.stglabs.ibm.com Ready master,worker 140d v1.22.3+e790d7f 10.28.20.45 <none> Red Hat Enterprise Linux CoreOS 49.84.202201042103-0 (Ootpa) 4.18.0-305.30.1.el8_4.x86_64 cri-o://1.22.1-10.rhaos4.9.gitf1d2c6e.el8
c83f1-dan2.ocp4.pokprv.stglabs.ibm.com Ready master,worker 140d v1.22.3+e790d7f 10.28.20.46 <none> Red Hat Enterprise Linux CoreOS 49.84.202201042103-0 (Ootpa) 4.18.0-305.30.1.el8_4.x86_64 cri-o://1.22.1-10.rhaos4.9.gitf1d2c6e.el8
c83f1-dan3.ocp4.pokprv.stglabs.ibm.com Ready master,worker 140d v1.22.3+e790d7f 10.28.20.47 <none> Red Hat Enterprise Linux CoreOS 49.84.202201042103-0 (Ootpa) 4.18.0-305.30.1.el8_4.x86_64 cri-o://1.22.1-10.rhaos4.9.gitf1d2c6e.el8
We are using high speed network to deploy CNSA/CSI/DAS, if one of HS interface went down, not sure how can we catch it. @TROPPENS TROPPENS commented 2 days ago
The bond should provide protection against single network link failures. Given that we have lost a whole bond, I assume that there have been multiple network link failures or that there is an issue in the underlying OpenShift network configuration. The current test systems are configured using NMState Operator which is in Tech Preview mode for OCP 4.9. This could explain the network glitch. @TROPPENS TROPPENS commented 2 days ago
I had a call with Nimrod to discuss resiliency against loss of high-speed network. Current NooBaa does not have any resiliency against the loss of the high-speed network and the loss of the PVC. He suggested to create a bug in the NooBaa GH to start a discussion on potential enhancements.
For MVP we need to document a limitation. Your environment
Build Version:
Machine:
Steps to reproduce
<Step 1>
<Step 2>
Expected behaviour