kubedirector icon indicating copy to clipboard operation
kubedirector copied to clipboard

setup sequencing enhancements

Open joel-bluedata opened this issue 6 years ago • 3 comments

Currently, if multiple roles are deployed/resized at the same time, we will handle the roles (run startscripts in their members) in some arbitrary order.

The app might have a stronger requirement like "members in the controller role must be set up before members in the worker role".

This sort of setup ordering can be handled in different ways, especially when we provide an optional in-container agent as per issue #35. However there could be some simple role-setup-sequencing rules expressed in the app CR so that the agent (or other in-container mechanisms) don't have to get involved.

Also, one thing we need to be aware of about the in-container mechanisms is that if a startscript waits on a startscript in some other role, that can deadlock. We wait on a script's response when we execute it, and syncMembers is not parallel at a roles level (issue #41). We need to do one of the following:

  • Declare that startscripts should never wait, and treat them more like a reconciliation handler that can be called multiple times until it succeeds. This diverges pretty significantly from the current model and is harder for the app author to think about.

  • Parallelize syncMembers more (and mutex protect calls to notifyReadyNodes) so that different roles can be handled at the same time. This one looks more promising.

We don't necessarily need to tackle the app-defined rules in 0.2.0, but the deadlock potential does need to be addressed. It seems very unlikely to happen until we introduce the agent and inter-member setup synch facilities, but let's head it off at the pass.

joel-bluedata avatar Oct 03 '18 17:10 joel-bluedata

Rather than parallelize syncMembers yet, we could instead move to using fire-and-forget startscripts. Start them running nohup'ed inside the container, then just come back around to check on their results on later handler invocations. Along with solving any potential startscript sequencing deadlocks, this would also allow users to request a resize during a long-running script.

Edit: can't actually request a resize for reasons described in PR #62. However, fairly importantly, you can delete the virtual cluster.

joel-bluedata avatar Oct 04 '18 17:10 joel-bluedata

Going to split the asynch-startscript stuff into a couple of independent issues.

joel-bluedata avatar Oct 05 '18 16:10 joel-bluedata

OK the remaining mission for this issue is:

Currently, if multiple roles are deployed/resized at the same time, we will handle the roles (run startscripts in their members) in some arbitrary order.

The app might have a stronger requirement like "members in the controller role must be set up before members in the worker role".

This sort of setup ordering can be handled in different ways, especially when we provide an optional in-container agent as per issue #35. However there could be some simple role-setup-sequencing rules expressed in the app CR so that the agent (or other in-container mechanisms) don't have to get involved.

I don't think that part is a blocker for the 0.2.0 milestone.

joel-bluedata avatar Oct 11 '18 20:10 joel-bluedata