csm icon indicating copy to clipboard operation
csm copied to clipboard

[BUG]: CSM CR report failed when deploy CSM-replication via CSM operator

Open LinZhWei opened this issue 10 months ago • 6 comments

Bug Description

In my env, I’ve 2 OCP clusters (source and target cluster for replication) and 2 powerflex system connected as storage. I deployed the replication module according to Installation using repctl | Dell Technologies. In Step7, the CSI module (CSI on source and target ocp cluster are all deployed by CSM operator from source) were deployed by CSM operator on source with replication enabled (csm yaml file is attached), but the csm CR was in failed state. image

With this failed status, the csm replication modules are installed and running, the replication function is still functional. I tried to go through some related code in https://github.com/dell/csm-operator/blob/addc485560bfa810c00cdd28eca75133eaec9722/pkg/utils/status.go#L193 I looks like in function getDaemonSetStatus, the totalRunning variable is not reset to 0 in cluster loop, which finally leads to incorrect totalAvialable calculation. Take my env for example: 1st loop on cluster1, totalRunning=3, totalAvialable += totalRunning (0+3=3) 2nd loop on cluster2, totalRunning++ for 3 pods, (start from 3 and finally =7). Then totalAvailable += totalRunning (3+7=10). Actually, the total available should be 7 here. Then the comparation between totalAvailable (10) and expected (7) is failed.

Logs

The csm-operator-controller always complains the daemonset is not in heathy state due to the checking on pod number is failed: daemonset status for cluster: cluster-2 {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.309Z INFO utils/status.go:212 nodeName is vxflexos-node {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.344Z INFO utils/status.go:237 Label is vxflexos-node {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:245 daemonset pod vxflexos-node-4lm5m : Running {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:245 daemonset pod vxflexos-node-sbx2m : Running {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:245 daemonset pod vxflexos-node-tk98c : Running {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:245 daemonset pod vxflexos-node-wfmz9 : Running {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:282 daemonset status available pods 7 {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:283 daemonset status failedCount pods 0 {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:284 daemonset status desired pods 4 {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:324 daemonset expected [7] {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:325 daemonset nodeStatus.Available [10] {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:331 deployment controllerStatus.Desired [1] {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:332 deployment controllerStatus.Available [1] {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:353 deployment or daemonset did not have enough available pods {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:354 deployment controllerStatus.Desired [1] {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:355 deployment controllerStatus.Available [1] {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:356 daemonset healthy: %!(EXTRA bool=false) {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:361 setting status to %!(EXTRA string=newStatus, *v1.ContainerStorageModuleStatus=&{{1 1 0} {10 7 0} Failed}) {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:394 Driver State {"TraceId": "vxflexos-3013", "Controller": {"available":"1","desired":"1","failed":"0"}, "Node": {"available":"10","desired":"7","failed":"0"}} 2024-04-19T06:25:04.359Z INFO utils/status.go:538 calculateState returns running: false {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:551 CSM state is failed, will requeue {"TraceId": "vxflexos-3013"} 2024-04-19T06:25:04.359Z INFO utils/status.go:553 HandleSuccess Driver state {"TraceId": "vxflexos-3013", "newStatus.State": "Failed"}

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

Install CSM operator on OCP operator hub. Install CSM replication with repctl:

  1. Prepare admin Kubernetes clusters configs
  2. Add admin configs as clusters to repctl: ./repctl cluster add -f "/root/.kube/config-1","/root/.kube/config-2" -n "cluster-2"
  3. Install replication controller and CRDs: ./repctl create -f ../deploy/replicationcrds.all.yaml &&./repctl create -f ../deploy/controller.yaml
  4. Inject service accounts’ configs into clusters: ./repctl cluster inject --use-sa
  5. Create replication storage classes using config: ./repctl create sc --from-config ./sc_values.yaml
  6. Install CSI driver for your chosen storage in source clusterwith CSM operator, replication module is enabled in CR ContainerStorageModule yaml file:
modules:
  # Replication: allows to configure replication
  # Replication CRDs must be installed before installing driver
  - name: replication
    # enabled: Enable/Disable replication feature
    # Allowed values:
    #   true: enable replication feature(install dell-csi-replicator sidecar)
    #   false: disable replication feature(do not install dell-csi-replicator sidecar)
    # Default value: false
    enabled: **true**
    configVersion: v1.7.0
    components:
    - name: dell-csi-replicator
      # image: Image to use for dell-csi-replicator. This shouldn't be changed
      # Allowed values: string
      # Default value: None
      image: dellemc/dell-csi-replicator:v1.7.0
      envs:
        # replicationPrefix: prefix to prepend to storage classes parameters
        # Allowed values: string
        # Default value: replication.storage.dell.com
        - name: "X_CSI_REPLICATION_PREFIX"
          value: "replication.storage.dell.com"
        # replicationContextPrefix: prefix to use for naming of resources created by replication feature
        # Allowed values: string
        - name: "X_CSI_REPLICATION_CONTEXT_PREFIX"
          value: "powerflex"

    - name: dell-replication-controller-manager
      # image: Defines controller image. This shouldn't be changed
      # Allowed values: string
      image: dellemc/dell-replication-controller:v1.7.0
      envs:
        # TARGET_CLUSTERS_IDS: comma separated list of cluster IDs of the targets clusters. DO NOT include the source(wherever CSM Operator is deployed) cluster ID
        # Set the value to "self" in case of stretched/single cluster configuration
        # Allowed values: string
        - name: "TARGET_CLUSTERS_IDS"
          value: "cluster-2"

Expected Behavior

CSM CR resource is not in failed state.

CSM Driver(s)

CSM operator: 1.4.4 csi: v2.9.0 replication:v1.7.0

Installation Type

No response

Container Storage Modules Enabled

No response

Container Orchestrator

OCP

Operating System

ocp 4.13.34

LinZhWei avatar Apr 22 '24 09:04 LinZhWei

In 1.4.4, this is a known issue. Could you try using v1.5.0 of the Operator?

atye avatar Apr 22 '24 13:04 atye

Hi @LinZhWei It also appears that you have mixed up the steps from Helm/repctl and CSM-Operator. To deploy replication module via CSM-Operator, please follow the steps in https://dell.github.io/csm-docs/docs/deployment/csmoperator/modules/replication/ where repctl preconfigures the clusters and lets Operator install the driver and module.

santhoshatdell avatar Apr 22 '24 13:04 santhoshatdell

@atye Thanks for this info, but issue still exists when using new operator v1.5.0 (Community) in OCP operator HUB.

@santhoshatdell My bad, I pasted the wrong steps. The biggest difference between these steps for Helm and CSM operator is: in step3 the replication-controller is not deployed by repctl, right? I've also done the installation with CSM operator with steps as in https://dell.github.io/csm-docs/docs/deployment/csmoperator/modules/replication/, but with same error. So it's nothing to do with this bug, am I right?

LinZhWei avatar Apr 23 '24 11:04 LinZhWei

@LinZhWei thanks for the update on this -- it does look like a bug in our status reporting function. For now, I would go ahead and verify the status of the deployment by checking the status of the pods, and we will work on getting this fixed asap.

jooseppi-luna avatar Apr 23 '24 12:04 jooseppi-luna

/sync

jooseppi-luna avatar Apr 24 '24 17:04 jooseppi-luna

link: 23603

csmbot avatar Apr 25 '24 00:04 csmbot

Fix is already present in main - https://github.com/dell/csm-operator/blob/main/pkg/utils/status.go#L201 Change was introduced as part of 1.10.2 patch release with csm operator v1.4.2 - https://github.com/dell/csm-operator/blob/v1.4.2/pkg/utils/status.go

nitesh3108 avatar May 21 '24 05:05 nitesh3108