noobaa-core icon indicating copy to clipboard operation
noobaa-core copied to clipboard

HPO 590 follow-up for discussion: noobaa-db-pg-0 does not failover if its host worker node has a network bond that goes down

Open MonicaLemay opened this issue 2 years ago • 0 comments

Environment info

  • NooBaa Version: VERSION INFO[0000] CLI version: 5.9.0 INFO[0000] noobaa-image: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:ef1dc9679ba33ad449f29ab4930bd8b1e3d717ebb29cca855dab3749dbb6d8e4 INFO[0000] operator-image: quay.io/rhceph-dev/odf4-mcg-rhel8-operator@sha256:01a31a47a43f01c333981056526317dfec70d1072dbd335c8386e0b3f63ef052 INFO[0000] noobaa-db-image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:98990a28bec6aa05b70411ea5bd9c332939aea02d9d61eedf7422a32cfa0be54 INFO[0000] Namespace: openshift-storage
  • Platform:

[root@c83f1-infa ~]# oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.9.5 NooBaa Operator 4.9.5 mcg-operator.v4.9.4 Succeeded ocs-operator.v4.9.5 OpenShift Container Storage 4.9.5 ocs-operator.v4.9.4 Succeeded odf-operator.v4.9.5 OpenShift Data Foundation 4.9.5 odf-operator.v4.9.4 Succeeded [root@c83f1-infa ~]#

Actual behavior

Bug Description

This is a follow-up to noobaa from this HPO defect. https://github.ibm.com/IBMSpectrumScale/hpo-core/issues/590

I am pasting 590 below, but to help the reader of this noobaa defect, I am pasting the last comment from Ulf and Nimrod who requested that this defect be open as a discussion place for this type of outage as well as loss of PVC .


TROPPENS commented 2 days ago

I had a call with Nimrod to discuss resiliency against loss of high-speed network. Current NooBaa does not have any resiliency against the loss of the high-speed network and the loss of the PVC. He suggested to create a bug in the NooBaa GH to start a discussion on potential enhancements.

This is a paste from HPO 590:

Bug Description

Our bare metal cluster in the POK lab has a problem where one of the network bonds goes to state down. Screen Shot 2022-03-31 at 1 14 53 PM

The node is dan1. The noobaa-db-pg-0 is hosted by dan1.

When the bond goes down, the noobaa-db-pg-0 pod does not fail over to another node. The database is not available when this condition occurs and is an I/O outage

Logs are in the https://ibm.ent.box.com/folder/145794528783?s=uueh7fp424vxs2bt4ndrnvh7uusgu6tocd 
 The are labeled as hpo 590

Steps to reproduce

Disable a network bond

Expected behaviour

the noobaa db pod should failover to another node but not the CSI attacher node

nilesh-bhosale commented 14 days ago

With bond going down, does the OCP node status become 'NotReady'? I believe, only when the node state becomes NotReady, k8s will trigger a pod fail over to one of the available (Ready) nodes in the cluster.

Monica commented below: As I noticed, the OCP node still is in the ready state, the OCP nodes are using provision network, those interfaces are up and running.

# oc get nodes -o wide
NAME                                     STATUS   ROLES           AGE    VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
c83f1-dan1.ocp4.pokprv.stglabs.ibm.com   Ready    master,worker   140d   v1.22.3+e790d7f   10.28.20.45   <none>        Red Hat Enterprise Linux CoreOS 49.84.202201042103-0 (Ootpa)   4.18.0-305.30.1.el8_4.x86_64   cri-o://1.22.1-10.rhaos4.9.gitf1d2c6e.el8
c83f1-dan2.ocp4.pokprv.stglabs.ibm.com   Ready    master,worker   140d   v1.22.3+e790d7f   10.28.20.46   <none>        Red Hat Enterprise Linux CoreOS 49.84.202201042103-0 (Ootpa)   4.18.0-305.30.1.el8_4.x86_64   cri-o://1.22.1-10.rhaos4.9.gitf1d2c6e.el8
c83f1-dan3.ocp4.pokprv.stglabs.ibm.com   Ready    master,worker   140d   v1.22.3+e790d7f   10.28.20.47   <none>        Red Hat Enterprise Linux CoreOS 49.84.202201042103-0 (Ootpa)   4.18.0-305.30.1.el8_4.x86_64   cri-o://1.22.1-10.rhaos4.9.gitf1d2c6e.el8

We are using high speed network to deploy CNSA/CSI/DAS, if one of HS interface went down, not sure how can we catch it. @TROPPENS TROPPENS commented 2 days ago

The bond should provide protection against single network link failures. Given that we have lost a whole bond, I assume that there have been multiple network link failures or that there is an issue in the underlying OpenShift network configuration. The current test systems are configured using NMState Operator which is in Tech Preview mode for OCP 4.9. This could explain the network glitch. @TROPPENS TROPPENS commented 2 days ago

I had a call with Nimrod to discuss resiliency against loss of high-speed network. Current NooBaa does not have any resiliency against the loss of the high-speed network and the loss of the PVC. He suggested to create a bug in the NooBaa GH to start a discussion on potential enhancements.

For MVP we need to document a limitation. Your environment

Build Version:
Machine:

Steps to reproduce

<Step 1>
<Step 2>

Expected behaviour

Expected behavior

Steps to reproduce

More information - Screenshots / Logs / Other output

MonicaLemay avatar Mar 31 '22 18:03 MonicaLemay