kubedirector
kubedirector copied to clipboard
support "pause" of kdcluster
We should support a boolean "switch" in the kdcluster spec that can be used to pause/stop the kdcluster. The effect of stopping would be:
- all role statefulsets scaled down to 0, regardless of what the kdapp says is legal
- member PVCs are NOT deleted
And the effect of restarting would be:
- all roles scaled back up to their spec'd statefulset count
While stopped, some things to consider:
- How should this affect the reported member states, member rollup status, and/or overall kdcluster state?
- Having 0-size statefulsets while members still exist will violate some assumptions in the code so we'll need to handle that with care.
- We will also need to have some extra validation and/or behavior on kdcluster spec edits. For example it would simplify things to disallow changes in role member counts while stopped.
Of course if none of the roles in a kdcluster use PVCs, this feature isn't terribly useful ... you could just delete the kdcluster and then re-create it later. But there's no reason we should particularly try to block using this feature on such kdclusters.
Other considerations:
- Should we allow the kdcluster to be initially created in a paused state? What about if it is in the middle of some other reconfiguration? It would be simplifying if we could say that a pause is not allowed until/unless the kdcluster is in a configured state. (This is similar to the discussion about when live upgrades are allowed.)
- Do we need a lifecycle event for this in the app startscripts? Basically "I just woke you back up". Maybe not since it should be effectively the same as any pod restart.
- Speaking of which, how concerned do we need to be about having a graceful/coordinated pause? Some apps could get quite upset if all their pods go down at once (and they may have specific scheduling affinities to try to avoid this). Does an app need to declare whether or not it is pause-able?
I think we can actually lift a lot of the ideas/decisions from the live-upgrade feature to answer those questions. I.e., only allow pause for a stable configured kdcluster and don't allow other changes while paused; let kdapps declare if they are pausable, but for old kdapps (that haven't declared this) they should be able to be edited to "pausable=true" even if in use.
Not sure about the lifecycle event but I'm inclined to not have that for now.
Going to see if I can look at this for a near-term release like 0.11.0.