consul-alerts
consul-alerts copied to clipboard
Enhance PagerDuty notifier with retry configuration
Current behavior: when there is an error making requests of the PagerDuty API, the client by default does not retry the request. consul-alerts
logs the error.
Preferred behavior: The underlying library providing PagerDuty support has logic built-in to retry requests when there is an error. By default, it does not perform retries, but through configuration values it can be set to do so. We would like to provide a way to optionally configure the client with these settings.
Primary benefit for consul-alerts
users: PagerDuty returns an error when they rate-limit your service key. Their documentation says that the correct thing to do is to retry the request after a period of time.
The client fields in question are:
client.MaxRetry = 5 // set max retries to 5 before failing, Defaults to 0.
client.RetryBaseInterval = 5 // set first retry to 5s. Defaults to 10s.
We suggest two optional fields in the PagerDuty notifier configs that would look like this in Consul:
consul-alerts/config/notifiers/pagerduty/max-retry
consul-alerts/config/notifiers/pagerduty/retry-base-interval
Since existing customer configurations don't have those keys, the code's default should be to retain the existing behavior. That means that if those keys are not present, that the PagerDuty client should not retry on errors.
The basic components of our suggested change are:
- adding the stated keys as optional parts of the
PagerDutyNotifier
struct and itsUnmarshal
-er - setting either of those values on the client if their key is present
- adding a test for PagerDuty configuration with and without those values present.
I have some example code that I could also paste here to illustrate these steps, or could supply as a PR.
Sounds like a good improvement. I would appreciate the PR and seems like it should be straight forward to review and approve.
@fusiondog comprehensive PR attached -- in addition to unit tests, I've also used a local instance to try out variations on set/unset values. I'm up for any adjustments to better match local source or config conventions.
Is it possible to make it retry infinitely?
@nh2 the way the underlying gopherduty client is written, it cannot be retried indefinitely. However, you can set a high # for max-retries
and a reasonable retry-base-interval
that will exponentially back off.
For instance, max-retry=15
and retry-base-interval=10
will start with a 10-second delay for the first retry, and will end up with 3.8 days between the 14th and 15th retry.
My own experiments show that so long as a retry happens after 60 seconds, it's likely to succeed. It's hard to determine what an appropriate amount of retries would be, though the current rate limits are about 150 requests-per-minute - so as long as you will retry past number of requests / 150
minutes, you'll likely be successful.
@fusiondog @darkcrux there is also a PR against the gopherduty library that will allow 403 (Rate limit response) to be treated as a retry-able error: https://github.com/darkcrux/gopherduty/pull/3
@StephenWeber Thanks for the detail, that is good to know; infinite retry would still be best for simplicity so that we don't have to rely on specific timings or logic.