kubedirector
kubedirector copied to clipboard
startscript invocation is synchronous now?
I noticed this when trying to launch a cdh5142cm cluster.
If the controller member gets its initial startscript invocation done on a reconciliation pass while some worker members are not yet ready to run their startscript, then the cluster never finishes configuration.
This seems to be because the controller startscript invocation never returns. Since KD can't run any new handlers for the kdcluster while a reconciliation handler is active, it can't do anything else in the cluster in the meantime, including running startscripts for worker when they become ready. This is an issue for cdh5142cm because the controller startscript is waiting on results from the worker startscripts (registering workers with the controller) -- deadlock.
This used to work!
I tried the previous 2 minor releases and observed this issue. It is probably rare in the currently-used kdclusters since it only would be obvious in this pattern where one startscript waits on the result of another one... but we gotta fix it.
In general we seem to have recently-ish had some issues making sure that all of the following is true about startscript invocations:
- they are backgrounded/aynsc
- they write stdout and stderr to the appropriate files
- they write the exit status value to the appropriate file when finished
This is not rocket science (one would think) but seems to be easily broken, in not-always-immediately-visible ways. (We need better regression testing on the existing example cluster images, but that's for discussing in a separate issue.) It would also be good to solve the old hairy issue #55 ... but anyway, for this specific issue we need to make sure startscript invocation has the above properties and then make sure it stays that way.