csm [BUG]: CSM CR report failed when deploy CSM-replication via CSM operator

Bug Description

In my env, I’ve 2 OCP clusters (source and target cluster for replication) and 2 powerflex system connected as storage. I deployed the replication module according to Installation using repctl | Dell Technologies. In Step7, the CSI module (CSI on source and target ocp cluster are all deployed by CSM operator from source) were deployed by CSM operator on source with replication enabled (csm yaml file is attached), but the csm CR was in failed state.

With this failed status, the csm replication modules are installed and running, the replication function is still functional. I tried to go through some related code in https://github.com/dell/csm-operator/blob/addc485560bfa810c00cdd28eca75133eaec9722/pkg/utils/status.go#L193 I looks like in function getDaemonSetStatus, the totalRunning variable is not reset to 0 in cluster loop, which finally leads to incorrect totalAvialable calculation. Take my env for example: 1st loop on cluster1, totalRunning=3, totalAvialable += totalRunning (0+3=3) 2nd loop on cluster2, totalRunning++ for 3 pods, (start from 3 and finally =7). Then totalAvailable += totalRunning (3+7=10). Actually, the total available should be 7 here. Then the comparation between totalAvailable (10) and expected (7) is failed.

Logs

The csm-operator-controller daemonset status 2024-04-19T06:25:04.309Z 2024-04-19T06:25:04.344Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z 2024-04-19T06:25:04.359Z always complains the daemonset is not in heathy state due to the checking on pod number is failed: for cluster: cluster-2 {"TraceId": "vxflexos-3013"} INFO utils/status.go:212 nodeName is vxflexos-node {"TraceId": "vxflexos-3013"} INFO utils/status.go:237 Label is vxflexos-node {"TraceId": "vxflexos-3013"} INFO utils/status.go:245 daemonset pod vxflexos-node-4lm5m : Running {"TraceId": "vxflexos-3013"} INFO utils/status.go:245 daemonset pod vxflexos-node-sbx2m : Running {"TraceId": "vxflexos-3013"} INFO utils/status.go:245 daemonset pod vxflexos-node-tk98c : Running {"TraceId": "vxflexos-3013"} INFO utils/status.go:245 daemonset pod vxflexos-node-wfmz9 : Running {"TraceId": "vxflexos-3013"} INFO utils/status.go:282 daemonset status available pods 7 {"TraceId": "vxflexos-3013"} INFO utils/status.go:283 daemonset status failedCount pods 0 {"TraceId": "vxflexos-3013"} INFO utils/status.go:284 daemonset status desired pods 4 {"TraceId": "vxflexos-3013"} INFO utils/status.go:324 daemonset expected [7] {"TraceId": "vxflexos-3013"} INFO utils/status.go:325 daemonset nodeStatus.Available [10] {"TraceId": "vxflexos-3013"} INFO utils/status.go:331 deployment controllerStatus.Desired [1] {"TraceId": "vxflexos-3013"} INFO utils/status.go:332 deployment controllerStatus.Available [1] {"TraceId": "vxflexos-3013"} INFO utils/status.go:353 deployment or daemonset did not have enough available pods {"TraceId": "vxflexos-3013"} INFO utils/status.go:354 deployment controllerStatus.Desired [1] {"TraceId": "vxflexos-3013"} INFO utils/status.go:355 deployment controllerStatus.Available [1] {"TraceId": "vxflexos-3013"} INFO utils/status.go:356 daemonset healthy: %!(EXTRA bool=false) {"TraceId": "vxflexos-3013"} INFO utils/status.go:361 setting status to %!(EXTRA string=newStatus, *v1.ContainerStorageModuleStatus=&{{1 1 0} {10 7 0} Failed}) {"TraceId": "vxflexos-3013"} INFO utils/status.go:394 Driver State {"TraceId": "vxflexos-3013", "Controller": {"available":"1","desired":"1","failed":"0"}, "Node": {"available":"10","desired":"7","failed":"0"}} INFO utils/status.go:538 calculateState returns running: false {"TraceId": "vxflexos-3013"} INFO utils/status.go:551 CSM state is failed, will requeue {"TraceId": "vxflexos-3013"} INFO utils/status.go:553 HandleSuccess Driver state {"TraceId": "vxflexos-3013", "newStatus.State": "Failed"}

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

Install CSM operator on OCP operator hub. Install CSM replication with repctl:

Prepare admin Kubernetes clusters configs
Add admin configs as clusters to repctl: ./repctl cluster add -f "/root/.kube/config-1","/root/.kube/config-2" -n "cluster-2"
Install replication controller and CRDs: ./repctl create -f ../deploy/replicationcrds.all.yaml &&./repctl create -f ../deploy/controller.yaml
Inject service accounts’ configs into clusters: ./repctl cluster inject --use-sa
Create replication storage classes using config: ./repctl create sc --from-config ./sc_values.yaml
Install CSI driver for your chosen storage in source clusterwith CSM operator, replication module is enabled in CR ContainerStorageModule yaml file:

modules:
  # Replication: allows to configure replication
  # Replication CRDs must be installed before installing driver
  - name: replication
    # enabled: Enable/Disable replication feature
    # Allowed values:
    #   true: enable replication feature(install dell-csi-replicator sidecar)
    #   false: disable replication feature(do not install dell-csi-replicator sidecar)
    # Default value: false
    enabled: **true**
    configVersion: v1.7.0
    components:
    - name: dell-csi-replicator
      # image: Image to use for dell-csi-replicator. This shouldn't be changed
      # Allowed values: string
      # Default value: None
      image: dellemc/dell-csi-replicator:v1.7.0
      envs:
        # replicationPrefix: prefix to prepend to storage classes parameters
        # Allowed values: string
        # Default value: replication.storage.dell.com
        - name: "X_CSI_REPLICATION_PREFIX"
          value: "replication.storage.dell.com"
        # replicationContextPrefix: prefix to use for naming of resources created by replication feature
        # Allowed values: string
        - name: "X_CSI_REPLICATION_CONTEXT_PREFIX"
          value: "powerflex"

    - name: dell-replication-controller-manager
      # image: Defines controller image. This shouldn't be changed
      # Allowed values: string
      image: dellemc/dell-replication-controller:v1.7.0
      envs:
        # TARGET_CLUSTERS_IDS: comma separated list of cluster IDs of the targets clusters. DO NOT include the source(wherever CSM Operator is deployed) cluster ID
        # Set the value to "self" in case of stretched/single cluster configuration
        # Allowed values: string
        - name: "TARGET_CLUSTERS_IDS"
          value: "cluster-2"

Expected Behavior

CSM CR resource is not in failed state.

CSM Driver(s)

CSM operator: 1.4.4 csi: v2.9.0 replication:v1.7.0

Installation Type

No response

Container Storage Modules Enabled

No response

Container Orchestrator

OCP

Operating System

ocp 4.13.34

Apr 22 '24 09:04 LinZhWei

In 1.4.4, this is a known issue. Could you try using v1.5.0 of the Operator?

Apr 22 '24 13:04 atye

Hi @LinZhWei It also appears that you have mixed up the steps from Helm/repctl and CSM-Operator. To deploy replication module via CSM-Operator, please follow the steps in https://dell.github.io/csm-docs/docs/deployment/csmoperator/modules/replication/ where repctl preconfigures the clusters and lets Operator install the driver and module.

Apr 22 '24 13:04 santhoshatdell

@atye Thanks for this info, but issue still exists when using new operator v1.5.0 (Community) in OCP operator HUB.

@santhoshatdell My bad, I pasted the wrong steps. The biggest difference between these steps for Helm and CSM operator is: in step3 the replication-controller is not deployed by repctl, right? I've also done the installation with CSM operator with steps as in https://dell.github.io/csm-docs/docs/deployment/csmoperator/modules/replication/, but with same error. So it's nothing to do with this bug, am I right?

Apr 23 '24 11:04 LinZhWei

@LinZhWei thanks for the update on this -- it does look like a bug in our status reporting function. For now, I would go ahead and verify the status of the deployment by checking the status of the pods, and we will work on getting this fixed asap.

Apr 23 '24 12:04 jooseppi-luna

/sync

Apr 24 '24 17:04 jooseppi-luna

link: 23603

Apr 25 '24 00:04 csmbot

Fix is already present in main - https://github.com/dell/csm-operator/blob/main/pkg/utils/status.go#L201 Change was introduced as part of 1.10.2 patch release with csm operator v1.4.2 - https://github.com/dell/csm-operator/blob/v1.4.2/pkg/utils/status.go

May 21 '24 05:05 nitesh3108

csm csm copied to clipboard

[BUG]: CSM CR report failed when deploy CSM-replication via CSM operator

Bug Description

Logs

Screenshots

Additional Environment Information

Steps to Reproduce

Expected Behavior

CSM Driver(s)

Installation Type

Container Storage Modules Enabled

Container Orchestrator

Operating System

csm
csm copied to clipboard