littlechef add way to resume fix runs for multiple nodes

add way to resume fix runs for multiple nodes

Open thekorn opened this issue 12 years ago • 3 comments

Fix has the ability to bulk configure a huge number of nodes (e.g. by using nodes_with_role:*) If there is any kind of error during the run (being it a error in the chef code, a timeout or sth else) fix will simply stop. It would be nice if something like a chef-resume file could be created locally which contains all nodes which has not been configured yet. This file could than be used to start fix again with the remaining nodes, without needing to reconfigure all other, already configured nodes.

Nov 06 '12 14:11 thekorn

As a side note: I don't think simply catching the fabric error and resuming with the next node is an option. Because skipping nodes programmatically is not a good idea...

Nov 06 '12 14:11 thekorn

Mmm, I cannot think of a non-hacky way of implementing a chef-resume. An alternative would be to catch the fabric error, then ask whether it should continue or not...

Nov 06 '12 19:11 tobami

Maybe a more generic solution is the way to go:

add a new command node_from_file:<filename> which reads hostnames from a given file and deploys them. Format of this file could be as simple as one line per host
whenever a crash in a bulk fixing command happens all nodes which have not been deployed yet are written to a generic file called chef-resume.
now the user could use the nodes_from_file to resume the previous fix

Nov 06 '12 21:11 thekorn

littlechef littlechef copied to clipboard

add way to resume fix runs for multiple nodes

littlechef
littlechef copied to clipboard