ceph-csi icon indicating copy to clipboard operation
ceph-csi copied to clipboard

Enhancement: Multi-Ceph Cluster Support for Topology-Aware Volume Provisioning with ConfigMap-Based Management

Open ahnuyh opened this issue 10 months ago • 6 comments

Describe the feature you'd like to have

I would like to propose an enhancement to the existing topology-aware volume provisioning in ceph-csi to support multi-ceph cluster environments. Currently, the topology-aware provisioning assumes volume creation within a single Ceph cluster. I'd like to extend this functionality to allow mapping of specific zones to different Ceph clusters, enabling the provisioning system to select the appropriate Ceph cluster based on the zone where a pod is scheduled.

What is the value to the end user? (why is it a priority?)

This feature would provide several benefits to end users:

  1. Elimination of single-point-of-failure: By distributing storage across multiple Ceph clusters aligned with Kubernetes zones, we can avoid having the entire regional Kubernetes setup dependent on a single Ceph cluster.

  2. Improved data locality: Volumes would be created in the Ceph cluster that corresponds to the zone where pods are running, potentially reducing network latency.

  3. Better isolation and fault tolerance: Storage failures would be contained within specific zones/clusters rather than affecting the entire environment.

  4. Enhanced scalability: Organizations can scale their storage infrastructure horizontally by adding new Ceph clusters for new zones.

How will we know we have a good solution? (acceptance criteria)

The solution should meet the following criteria:

  1. StorageClass should support specifying multiple Ceph clusters with their corresponding topology information (zones).

  2. When a PVC is created, the provisioner should be able to identify the appropriate Ceph cluster based on the pod's scheduling constraints or node affinity rules.

  3. The solution should seamlessly integrate with existing topology-aware scheduling in Kubernetes.

  4. No changes should be required in applications using the PVCs.

  5. The feature should include documentation on how to configure and use multi-cluster topology-aware provisioning.

  6. Existing deployments using single-cluster topology should continue to work without modification.

  7. The solution should provide clear error messages when no suitable Ceph cluster can be found for a given topology constraint.

Additional context

Here's a sequence diagram showing the proposed workflow:

sequenceDiagram
    participant User
    participant K8s as Kubernetes API
    participant CM as ConfigMap
    participant SC as StorageClass
    participant CSI as CSI Controller
    participant Scheduler as K8s Scheduler
    participant Node as K8s Node
    participant CephA as Ceph Cluster A (Zone A)
    participant CephB as Ceph Cluster B (Zone B)

    User->>K8s: Create cluster topology ConfigMap
    K8s-->>User: ConfigMap created
    User->>K8s: Create StorageClass with volumeBindingMode: WaitForFirstConsumer
    K8s-->>User: StorageClass created
    
    User->>K8s: Create StatefulSet with PVCs using StorageClass
    K8s-->>User: StatefulSet created
    K8s->>K8s: Create unbound PVCs
    
    Note over K8s, Scheduler: For each pod in StatefulSet
    K8s->>Scheduler: Schedule pod
    Scheduler->>K8s: Pod assigned to specific node in Zone A
    K8s->>CSI: CreateVolumeRequest with selected-node and zone info
    CSI->>CSI: pickZoneFromNode() extracts zone from node
    CSI->>CM: Get cluster topology configuration
    CM-->>CSI: Return topology mapping
    CSI->>CSI: Match zone to appropriate Ceph cluster
    Note over CSI: Determine that Zone A maps to Ceph Cluster A
    CSI->>CephA: Create volume
    CephA-->>CSI: Volume created
    CSI->>K8s: Create PV with node affinity for Zone A
    K8s->>K8s: Bind PVC to PV
    K8s->>Node: Start pod with bound volume
    Node->>CSI: Stage and publish volume
    CSI->>CM: Get cluster info for volume
    CM-->>CSI: Return cluster A connection details
    CSI->>CephA: Connect to volume
    CephA-->>Node: Volume mounted
    
    Note over User,K8s: Later - Update topology (no disruption to existing volumes)
    User->>K8s: Update cluster topology ConfigMap
    K8s-->>CM: ConfigMap updated
    Note over CSI: New volumes use updated topology mapping

  • storage-class
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-rbd-multi-cluster
provisioner: rbd.csi.ceph.com
parameters:
  clusterTopologyConfigMap: ceph-cluster-topology
  • configmap
apiVersion: v1
kind: ConfigMap
metadata:
  name: ceph-cluster-topology
  namespace: ceph-csi
data:
  config.json: |
    {
      "clusterTopology": [
        {
          "clusterID": "cluster-a",
          "monitors": "mon1:port,mon2:port,mon3:port",
          "zones": ["us-east-1a", "us-east-1b"],
          "pool": "replicapool",
          "cephfs": {
            "subvolumePath": "/volumes"
          }
        },
        {
          "clusterID": "cluster-b",
          "monitors": "mon4:port,mon5:port,mon6:port",
          "zones": ["us-east-1c", "us-east-1d"],
          "pool": "replicapool",
          "cephfs": {
            "subvolumePath": "/volumes"
          }
        }
      ]
    }

ahnuyh avatar Feb 25 '25 14:02 ahnuyh

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 27 '25 21:03 github-actions[bot]

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

github-actions[bot] avatar Apr 03 '25 21:04 github-actions[bot]

Hey! this looks like an amazing feature that would be usefull for some k8s users. Might be worth it to take a look at?

@Madhu-1 Dont you think it could be implemented in future versions of the plugin?

lechugaletal avatar Jun 09 '25 10:06 lechugaletal

@lechugaletal yes we could have it, If someone is planning to work on it, It would be a great feature to have it in cephcsi.

Madhu-1 avatar Jun 11 '25 08:06 Madhu-1

I think the idea would be to have a mapping between the exisiting kuberentes node label for failure zone : topology.kubernetes.io/zone and a new label to be introduced on the definition of ceph clusters on ceph-csi side. You proposed the ceph-csi label to be named "zones". I would suggest the default value to be "*" for no restriction. I like the idea of being able to specify multiple zones on ceph.

The PV mount should be impossible if the zone do not match on parent ceph cluster and kubernetes worker node.

However PV creation should not fail. The way to guarantee that PV and pods are located in the same zone is to use the storage class parameter : WaitForFirstConsumer=true. In this way the PV is created after the pod is scheduled into a zone.

https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone

julienlau avatar Jun 13 '25 15:06 julienlau

This feature exists on some closed source storage solutions like portworx : https://docs.portworx.com/portworx-enterprise/operations/operate-kubernetes/cluster-topology but it is not opensource.

julienlau avatar Jun 13 '25 16:06 julienlau

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jul 13 '25 21:07 github-actions[bot]

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

github-actions[bot] avatar Jul 21 '25 21:07 github-actions[bot]