scylla-operator Automatically replace scylla node that looses storage

Automatically replace scylla node that looses storage

Open tnozicka opened this issue 3 years ago • 3 comments

Is this a bug report or feature request?

Feature Request

What should the feature do: When a node is decommissioned, the local storage is lost as well. Currently it requires a manual action by annotating the service to trigger replacement, otherwise the new pod is stuck on join as it doesn't have replace-address-first-boot set.

What is use case behind this feature: Stability - operator should be able to run scylla without any user intervention.

Additional Information: One option is to write down a file in an init container, check its presents and generate the replace-address-first-boot. Maybe there is a more sophisticated way to get the same information directly from scylla.

Mar 17 '21 11:03 tnozicka

The case of a kubernetes node being decommissioned is covered by AutomaticOrphanedNodeCleanup although there is a race if the scylla node wouldn't be bootstrapped yet which would get stuck on scylla not replacing a node that's not in gossip.

There are other cases though AWS EC2 instances loose local disks when the instance is stopped, hibernated or a local drive fails. Same goes for GCP. Those won't result in a Kubernetes node removal so AutomaticOrphanedNodeCleanup won't help here and it can get stuck, needed a manual intervention to trigger a node replacement.

Jul 21 '21 10:07 tnozicka

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

Jun 26 '24 10:06 scylla-operator-bot[bot]

/remove-lifecycle stale /triage accepted

Jun 26 '24 15:06 tnozicka

scylla-operator scylla-operator copied to clipboard

Automatically replace scylla node that looses storage

scylla-operator
scylla-operator copied to clipboard