littlechef icon indicating copy to clipboard operation
littlechef copied to clipboard

add way to resume fix runs for multiple nodes

Open thekorn opened this issue 12 years ago • 3 comments

Fix has the ability to bulk configure a huge number of nodes (e.g. by using nodes_with_role:*) If there is any kind of error during the run (being it a error in the chef code, a timeout or sth else) fix will simply stop. It would be nice if something like a chef-resume file could be created locally which contains all nodes which has not been configured yet. This file could than be used to start fix again with the remaining nodes, without needing to reconfigure all other, already configured nodes.

thekorn avatar Nov 06 '12 14:11 thekorn

As a side note: I don't think simply catching the fabric error and resuming with the next node is an option. Because skipping nodes programmatically is not a good idea...

thekorn avatar Nov 06 '12 14:11 thekorn

Mmm, I cannot think of a non-hacky way of implementing a chef-resume. An alternative would be to catch the fabric error, then ask whether it should continue or not...

tobami avatar Nov 06 '12 19:11 tobami

Maybe a more generic solution is the way to go:

  • add a new command node_from_file:<filename> which reads hostnames from a given file and deploys them. Format of this file could be as simple as one line per host
  • whenever a crash in a bulk fixing command happens all nodes which have not been deployed yet are written to a generic file called chef-resume.
  • now the user could use the nodes_from_file to resume the previous fix

thekorn avatar Nov 06 '12 21:11 thekorn