improve resilience against transient errors

Open ryane opened this issue 9 years ago • 0 comments

converge is currently sensitive to transient errors, particular when performing operations over the network like downloading files, keys, docker images, etc. Often, immediately running converge again after a failure will result in a successful application.

This issue is for tracking ideas on how we can improve the resilience of converge against ephemeral errors. Some ideas:

Build simple retry logic (perhaps with an exponential backoff) into the core engine
Resources can implement their own specialized retry logic if needed
Resources should be able to opt-out if it is undesirable or unsafe to retry
Some control of the retry behavior can be exposed to users via hcl parameters

Additional thoughts / ideas?

Nov 21 '16 17:11 ryane