anthill icon indicating copy to clipboard operation
anthill copied to clipboard

Automatic replacement of failed nodes

Open JohnStrunk opened this issue 6 years ago • 0 comments

Describe the feature you'd like to have. When a gluster pod fails, kube will attempt to restart it; if it was a simple crash or other transient problem, this should be sufficient to repair the system (plus automatic heal). However, if the node's state becomes corrupt or is lost, it may be necessary to remove the failed node from the cluster and potentially spawn a new one to take its place.

What is the value to the end user? (why is it a priority?) If a gluster node (pod) remains offline, the associated bricks will have a reduced level of availability & reliability. Being able to automatically repair failures will help increase system availability and protect users' data.

How will we know we have a good solution? (acceptance criteria)

  • Kubernetes will act as the 1st line of defense, restarting failed Gluster pods
  • A Gluster pod that remains offline from the gluster cluster for an extended period of time will have its bricks moved to other Gluster nodes (by GD2). Permissible downtime should be configurable.
  • Gluster nodes that have been "abandoned" by GD2 should be removed from the TSP and destroyed by the operator
  • Ability to mark a node via the CR such that it will not be subject to replacement (abandonment by GD2 nor destruction by the operator). This is necessary in cases where a Gluster node is expected to be temporarily unavailable (i.e., scheduled downtime or other maintenance).

Additional context This relies on the node state machine (#17) and an, as yet, unimplemented GD2 automigration plugin.

JohnStrunk avatar Jun 27 '18 18:06 JohnStrunk