converge
converge copied to clipboard
improve resilience against transient errors
converge is currently sensitive to transient errors, particular when performing operations over the network like downloading files, keys, docker images, etc. Often, immediately running converge again after a failure will result in a successful application.
This issue is for tracking ideas on how we can improve the resilience of converge against ephemeral errors. Some ideas:
- Build simple retry logic (perhaps with an exponential backoff) into the core engine
- Resources can implement their own specialized retry logic if needed
- Resources should be able to opt-out if it is undesirable or unsafe to retry
- Some control of the retry behavior can be exposed to users via hcl parameters
Additional thoughts / ideas?