cstor-operators icon indicating copy to clipboard operation
cstor-operators copied to clipboard

feature request: Automatic Node Deletion Detection and Migration of Volume Replicas

Open jayheinlein opened this issue 3 years ago • 0 comments

Ticket requested by @niladrih via Slack

When using OpenEBS cStor operators, it would be a very nice feature that in the event of an unexpected node failure the cStor operators would identify the failed node and automatically move the replicas attached to the CSPI to other, valid nodes. This would allow automatic continuation of cluster functionality and the failed node could be removed and replaced later by an engineer. This likely involves several steps.

  1. Identification of failed node(s)
  2. Modification of cStor objects to remove and replace bad CSPI references
  3. Preparation of CSPI to be deleted in the future

From my testing, recovery from this case manually is complicated by the fact that removal of old CSPI references does not update the CSPI object as the pod is no longer running due to the non-existent node. I had to do manual removal of some finalizers and objects to confirm that all references to the dead CSPI were removed. Then I had to modify the CSPI itself to remove replica counts so the CSPC would allow deletion of the CSPI from the object. Ultimately this all worked but was certainly not an ideal situation.

Kubernetes 1.17.12 OpenEBS 2.7.0 cStor CSI/CSPC Operators

jayheinlein avatar Mar 18 '21 20:03 jayheinlein