littlechef
littlechef copied to clipboard
add way to resume fix runs for multiple nodes
Fix has the ability to bulk configure a huge number of nodes (e.g. by using nodes_with_role:*
)
If there is any kind of error during the run (being it a error in the chef code, a timeout or sth else) fix will simply stop.
It would be nice if something like a chef-resume
file could be created locally which contains all nodes which has not been configured yet. This file could than be used to start fix again with the remaining nodes, without needing to reconfigure all other, already configured nodes.
As a side note: I don't think simply catching the fabric error and resuming with the next node is an option. Because skipping nodes programmatically is not a good idea...
Mmm, I cannot think of a non-hacky way of implementing a chef-resume. An alternative would be to catch the fabric error, then ask whether it should continue or not...
Maybe a more generic solution is the way to go:
- add a new command
node_from_file:<filename>
which reads hostnames from a given file and deploys them. Format of this file could be as simple as one line per host - whenever a crash in a bulk fixing command happens all nodes which have not been deployed yet are written to a generic file called
chef-resume
. - now the user could use the
nodes_from_file
to resume the previous fix