cass-operator icon indicating copy to clipboard operation
cass-operator copied to clipboard

K8SSAND-1180 ⁃ How do we gracefully increase storage capacity via cass-operator while Cass Datacenter, Statefulset etc are in service with incoming workloads

Open mparikhcloudbeds opened this issue 3 years ago • 26 comments

Following the below thread, wanted to get an update: https://community.datastax.com/questions/12269/index.html

Environment:

  • AWS EKS and AWS EBS
  • Cass-Operator : 1.9
  • Server Image : DSE 6.8.18 and/or OSS 3.11.11

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1180 ┆priority: Medium

mparikhcloudbeds avatar Jan 20 '22 12:01 mparikhcloudbeds

Hi, does your PV provider support PVC volume expansion?

burmanm avatar Jan 21 '22 15:01 burmanm

Hi, does your PV provider support PVC volume expansion?

@burmanm - Yes, the storage class that we are using has the following property. allowVolumeExpansion: true

mparikhcloudbeds avatar Jan 22 '22 20:01 mparikhcloudbeds

@burmanm - Following up to see if there's an update on this?

mparikhcloudbeds avatar Jan 28 '22 12:01 mparikhcloudbeds

Hey, sorry. The process of expanding a PVC with StatefulSets is a bit tricky and involves manual operations (restriction of Kubernetes). Sadly my local instance did not support the feature, but I'll try to create an example shortly with documented steps.

burmanm avatar Jan 28 '22 13:01 burmanm

thnx @burmanm .

Is this something on the roadmap of cass-operator project?

mparikhcloudbeds avatar Jan 28 '22 14:01 mparikhcloudbeds

It's a feature we would like to see, but unfortunately has not been scheduled yet. We have identified the steps to resolve the issue, but it will require a bit of time to implement.

bradfordcp avatar Apr 12 '22 05:04 bradfordcp

Hey, sorry. The process of expanding a PVC with StatefulSets is a bit tricky and involves manual operations (restriction of Kubernetes). Sadly my local instance did not support the feature, but I'll try to create an example shortly with documented steps.

@burmanm Could you provide more details about this? I have a 4 node cluster and there disk usage is almost full. A workaround is to add nodes in cluster, and the data will rebalanced, and cleanup auto. But it is a waste of cpu and memory resources.

counter2015 avatar May 06 '22 06:05 counter2015

@counter2015 you can easily upgrade your storage manually:

  • set new storage capacity in your PVC
  • restart the cassandra pods one by one

Then, your PVCs should automatically get resized by your storage csi.

discostur avatar Jun 09 '22 10:06 discostur

@discostur I am not sure if the PVC capactity will be changed by operator after I edit datacenter yaml file. Finally, I incresed storage capacity by creating a new datacenter and migrating data from old dc1 to new dc2.

counter2015 avatar Jun 09 '22 10:06 counter2015

@counter2015 no it does not! i edited my datacenter yaml file and nothing was changed in the pvc / pv. So i edited the pvc manually and the storage was resized ...

discostur avatar Jun 09 '22 15:06 discostur

The process is actually a bit more involved to do it safely.

First, we need to delete the StatefulSet without deleting the pods. This can be done for example with kubectl delete --cascade=false.

Next, make sure that persistentVolumeReclaimPolicy on the PV is set to Retain. Remove the claim reference. Then delete the PVC.

Now go ahead expand the volume and update the capacity in the PV spec.

Create new PVC that will bind to the PV. The name of the PVC needs to be the same as the name of the old one.

Lastly, recreate the StatefulSet. The StatefulSet controller find the existing PVCs and pods. The StatefulSet will immediately move into the ready state (assuming the pods are ready).

jsanda avatar Jun 17 '22 03:06 jsanda

@jsanda Is there any risk to edit the pvc size directly ?

counter2015 avatar Jun 17 '22 04:06 counter2015

That may work and might be easier than what I prescribed. I would need to do some testing/investigation to be certain.

jsanda avatar Jun 17 '22 04:06 jsanda

prometheus-operator (which uses statetulset for prometheus pod as well) offeris this way. But this does not work for k8scassandra because of admission webhook:

admission webhook "vcassandradatacenter.kb.io" denied the request: CassandraDatacenter write rejected, attempted to change storageConfig

My example:

k8ssandracluster:

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: demo
spec:
  cassandra:
    serverVersion: "4.0.3"
    serverImage: k8ssandra/cass-management-api:4.0.3
    telemetry:
      prometheus:
        enabled: true
    storageConfig:
      cassandraDataVolumeClaimSpec:
        storageClassName: gp3-multizone
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
    config:
      jvmOptions:
        heapSize: 512M
    datacenters:
      - metadata:
          name: dc1
        size: 9
        racks:
          - name: r1
            nodeAffinityLabels:
              onairent.live/node-type: cassandra-node
              topology.kubernetes.io/zone: eu-north-1a
          - name: r2
            nodeAffinityLabels:
              onairent.live/node-type: cassandra-node
              topology.kubernetes.io/zone: eu-north-1b
          - name: r3
            nodeAffinityLabels:
              onairent.live/node-type: cassandra-node
              topology.kubernetes.io/zone: eu-north-1c
  • Change storage to 150Gi
  • Apply changed manifest
  • Patch PVCs
for p in $(kubectl get pvc -l cassandra.datastax.com/datacenter=dc1 -o jsonpath='{range .items[*]}{.metadata.name} {end}'); do \
  kubectl patch pvc/${p} --patch '{"spec": {"resources": {"requests": {"storage":"150Gi"}}}}'; \
done
  • Delete statefulsets
kubectl delete statefulset -l cassandra.datastax.com/datacenter=dc1 --cascade=orphan

After that no changes are applied to cassandra cluster due to the error mentioned above. Even if I try to resize my cluster I get the error and nothing happens.

okgolove avatar Jan 05 '23 10:01 okgolove

having the same issue described in previous comment what is the procedure of increase storage capacity in this case?

adziura-ledger avatar May 30 '23 13:05 adziura-ledger

I have directly edited the PVCs and restarted the pods in my test environment. Well, nothing is broken and I can see the new size in the PVCs reflected and access the test data.

chandapukiran avatar May 31 '23 05:05 chandapukiran

Check prometheus-operator resizing manual. Works fine for k8ssandra as well.

okgolove avatar May 31 '23 07:05 okgolove

@okgolove I was trying the steps provided on prometheus-operator resizing manual and it worked for me but when i deleted the cluster and tried on a new cluster. It throws the error you mentioned above. Is it still working for you?

Error from server (CassandraDatacenter write rejected, attempted to change storageConfig.CassandraDataVolumeClaimSpec): admission webhook "vcassandradatacenter.kb.io" denied the request: CassandraDatacenter write rejected, attempted to change storageConfig.CassandraDataVolumeClaimSpec

chandapukiran avatar Jun 07 '23 11:06 chandapukiran

@chandapukiran have you changed storage size in cluster manifest before recreating?

okgolove avatar Jun 07 '23 11:06 okgolove

@okgolove No, so basically i have created a cluster with a default size and later tried to change the size by trying to modify the cass object

chandapukiran avatar Jun 07 '23 12:06 chandapukiran

@chandapukiran ahh, yes. Admission webhok won't let you make this change. I disabled it temporary then modified.

okgolove avatar Jun 07 '23 12:06 okgolove

@okgolove oh ok, could you share me the commands to disable/enable admission webhook

chandapukiran avatar Jun 07 '23 12:06 chandapukiran

@chandapukiran how did you install the operator? If via helm chart then just set

cass-operator:
  admissionWebhooks:
    enabled: false

Or just delete admission webhook via kubectl

okgolove avatar Jun 07 '23 12:06 okgolove

Thanks @okgolove , i see it is already disabled in my helm chart but I now understand why it worked for me before but not now. I was playing with k8ssandra-operator in another namespace and that was causing the issue. Now I am good.

chandapukiran avatar Jun 07 '23 13:06 chandapukiran

Adding the exact steps to be followed for quick reference:

  • disable admissionWebhooks in operator and re-deploy it - cass-operator: admissionWebhooks: enabled: false
  • stop the required data-centers and set new value for volume size in K8ssandraCluster and apply the changes. Set stopped: true flag in each of the required data-centers in the datacenters list and apply the yaml file using kubectl apply -f <file>.
  • manually edit the PVC to the required size for each node in the cluster. One can use kubectl edit pvc <pvc-name> -n <namespace> and edit the size in the spec section
  • delete the underlying StatefulSet using the orphan deletion strategy: kubectl delete statefulset <sts-name> -n <namespace> --cascade=orphan
  • remove the stopped flag in k8ssandra-cluster yaml file and apply the changes to re-start the stopped data-centers in the cluster
  • re-enable admissionWebhooks in operator and re-deploy it

surajk94 avatar Jun 13 '23 12:06 surajk94

Implementation ticket: #602

burmanm avatar Dec 19 '23 08:12 burmanm