scylla-operator
scylla-operator copied to clipboard
Automatically replace scylla node that looses storage
Is this a bug report or feature request?
- Feature Request
What should the feature do: When a node is decommissioned, the local storage is lost as well. Currently it requires a manual action by annotating the service to trigger replacement, otherwise the new pod is stuck on join as it doesn't have replace-address-first-boot set.
What is use case behind this feature: Stability - operator should be able to run scylla without any user intervention.
Additional Information: One option is to write down a file in an init container, check its presents and generate the replace-address-first-boot. Maybe there is a more sophisticated way to get the same information directly from scylla.
The case of a kubernetes node being decommissioned is covered by AutomaticOrphanedNodeCleanup
although there is a race if the scylla node wouldn't be bootstrapped yet which would get stuck on scylla not replacing a node that's not in gossip.
There are other cases though AWS EC2 instances loose local disks when the instance is stopped, hibernated or a local drive fails. Same goes for GCP. Those won't result in a Kubernetes node removal so AutomaticOrphanedNodeCleanup
won't help here and it can get stuck, needed a manual intervention to trigger a node replacement.
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 30d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out
/lifecycle stale
/remove-lifecycle stale /triage accepted