Enhancement: Multi-Ceph Cluster Support for Topology-Aware Volume Provisioning with ConfigMap-Based Management
Describe the feature you'd like to have
I would like to propose an enhancement to the existing topology-aware volume provisioning in ceph-csi to support multi-ceph cluster environments. Currently, the topology-aware provisioning assumes volume creation within a single Ceph cluster. I'd like to extend this functionality to allow mapping of specific zones to different Ceph clusters, enabling the provisioning system to select the appropriate Ceph cluster based on the zone where a pod is scheduled.
What is the value to the end user? (why is it a priority?)
This feature would provide several benefits to end users:
-
Elimination of single-point-of-failure: By distributing storage across multiple Ceph clusters aligned with Kubernetes zones, we can avoid having the entire regional Kubernetes setup dependent on a single Ceph cluster.
-
Improved data locality: Volumes would be created in the Ceph cluster that corresponds to the zone where pods are running, potentially reducing network latency.
-
Better isolation and fault tolerance: Storage failures would be contained within specific zones/clusters rather than affecting the entire environment.
-
Enhanced scalability: Organizations can scale their storage infrastructure horizontally by adding new Ceph clusters for new zones.
How will we know we have a good solution? (acceptance criteria)
The solution should meet the following criteria:
-
StorageClass should support specifying multiple Ceph clusters with their corresponding topology information (zones).
-
When a PVC is created, the provisioner should be able to identify the appropriate Ceph cluster based on the pod's scheduling constraints or node affinity rules.
-
The solution should seamlessly integrate with existing topology-aware scheduling in Kubernetes.
-
No changes should be required in applications using the PVCs.
-
The feature should include documentation on how to configure and use multi-cluster topology-aware provisioning.
-
Existing deployments using single-cluster topology should continue to work without modification.
-
The solution should provide clear error messages when no suitable Ceph cluster can be found for a given topology constraint.
Additional context
Here's a sequence diagram showing the proposed workflow:
sequenceDiagram
participant User
participant K8s as Kubernetes API
participant CM as ConfigMap
participant SC as StorageClass
participant CSI as CSI Controller
participant Scheduler as K8s Scheduler
participant Node as K8s Node
participant CephA as Ceph Cluster A (Zone A)
participant CephB as Ceph Cluster B (Zone B)
User->>K8s: Create cluster topology ConfigMap
K8s-->>User: ConfigMap created
User->>K8s: Create StorageClass with volumeBindingMode: WaitForFirstConsumer
K8s-->>User: StorageClass created
User->>K8s: Create StatefulSet with PVCs using StorageClass
K8s-->>User: StatefulSet created
K8s->>K8s: Create unbound PVCs
Note over K8s, Scheduler: For each pod in StatefulSet
K8s->>Scheduler: Schedule pod
Scheduler->>K8s: Pod assigned to specific node in Zone A
K8s->>CSI: CreateVolumeRequest with selected-node and zone info
CSI->>CSI: pickZoneFromNode() extracts zone from node
CSI->>CM: Get cluster topology configuration
CM-->>CSI: Return topology mapping
CSI->>CSI: Match zone to appropriate Ceph cluster
Note over CSI: Determine that Zone A maps to Ceph Cluster A
CSI->>CephA: Create volume
CephA-->>CSI: Volume created
CSI->>K8s: Create PV with node affinity for Zone A
K8s->>K8s: Bind PVC to PV
K8s->>Node: Start pod with bound volume
Node->>CSI: Stage and publish volume
CSI->>CM: Get cluster info for volume
CM-->>CSI: Return cluster A connection details
CSI->>CephA: Connect to volume
CephA-->>Node: Volume mounted
Note over User,K8s: Later - Update topology (no disruption to existing volumes)
User->>K8s: Update cluster topology ConfigMap
K8s-->>CM: ConfigMap updated
Note over CSI: New volumes use updated topology mapping
- storage-class
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-rbd-multi-cluster
provisioner: rbd.csi.ceph.com
parameters:
clusterTopologyConfigMap: ceph-cluster-topology
- configmap
apiVersion: v1
kind: ConfigMap
metadata:
name: ceph-cluster-topology
namespace: ceph-csi
data:
config.json: |
{
"clusterTopology": [
{
"clusterID": "cluster-a",
"monitors": "mon1:port,mon2:port,mon3:port",
"zones": ["us-east-1a", "us-east-1b"],
"pool": "replicapool",
"cephfs": {
"subvolumePath": "/volumes"
}
},
{
"clusterID": "cluster-b",
"monitors": "mon4:port,mon5:port,mon6:port",
"zones": ["us-east-1c", "us-east-1d"],
"pool": "replicapool",
"cephfs": {
"subvolumePath": "/volumes"
}
}
]
}
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
Hey! this looks like an amazing feature that would be usefull for some k8s users. Might be worth it to take a look at?
@Madhu-1 Dont you think it could be implemented in future versions of the plugin?
@lechugaletal yes we could have it, If someone is planning to work on it, It would be a great feature to have it in cephcsi.
I think the idea would be to have a mapping between the exisiting kuberentes node label for failure zone : topology.kubernetes.io/zone and a new label to be introduced on the definition of ceph clusters on ceph-csi side. You proposed the ceph-csi label to be named "zones". I would suggest the default value to be "*" for no restriction. I like the idea of being able to specify multiple zones on ceph.
The PV mount should be impossible if the zone do not match on parent ceph cluster and kubernetes worker node.
However PV creation should not fail. The way to guarantee that PV and pods are located in the same zone is to use the storage class parameter : WaitForFirstConsumer=true. In this way the PV is created after the pod is scheduled into a zone.
https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone
This feature exists on some closed source storage solutions like portworx : https://docs.portworx.com/portworx-enterprise/operations/operate-kubernetes/cluster-topology but it is not opensource.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.