kubedirector
kubedirector copied to clipboard
setup sequencing enhancements
Currently, if multiple roles are deployed/resized at the same time, we will handle the roles (run startscripts in their members) in some arbitrary order.
The app might have a stronger requirement like "members in the controller role must be set up before members in the worker role".
This sort of setup ordering can be handled in different ways, especially when we provide an optional in-container agent as per issue #35. However there could be some simple role-setup-sequencing rules expressed in the app CR so that the agent (or other in-container mechanisms) don't have to get involved.
Also, one thing we need to be aware of about the in-container mechanisms is that if a startscript waits on a startscript in some other role, that can deadlock. We wait on a script's response when we execute it, and syncMembers is not parallel at a roles level (issue #41). We need to do one of the following:
-
Declare that startscripts should never wait, and treat them more like a reconciliation handler that can be called multiple times until it succeeds. This diverges pretty significantly from the current model and is harder for the app author to think about.
-
Parallelize syncMembers more (and mutex protect calls to notifyReadyNodes) so that different roles can be handled at the same time. This one looks more promising.
We don't necessarily need to tackle the app-defined rules in 0.2.0, but the deadlock potential does need to be addressed. It seems very unlikely to happen until we introduce the agent and inter-member setup synch facilities, but let's head it off at the pass.
Rather than parallelize syncMembers yet, we could instead move to using fire-and-forget startscripts. Start them running nohup'ed inside the container, then just come back around to check on their results on later handler invocations. Along with solving any potential startscript sequencing deadlocks, this would also allow users to request a resize during a long-running script.
Edit: can't actually request a resize for reasons described in PR #62. However, fairly importantly, you can delete the virtual cluster.
Going to split the asynch-startscript stuff into a couple of independent issues.
OK the remaining mission for this issue is:
Currently, if multiple roles are deployed/resized at the same time, we will handle the roles (run startscripts in their members) in some arbitrary order.
The app might have a stronger requirement like "members in the controller role must be set up before members in the worker role".
This sort of setup ordering can be handled in different ways, especially when we provide an optional in-container agent as per issue #35. However there could be some simple role-setup-sequencing rules expressed in the app CR so that the agent (or other in-container mechanisms) don't have to get involved.
I don't think that part is a blocker for the 0.2.0 milestone.