alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

notify/webhook: allow configuring HTTP status codes to retry on

Open jvrplmlmn opened this issue 2 years ago • 8 comments

Introduces a new webhook config retry_codes which is a list of additional HTTP codes that the Webhook notifier should retry on.

A specific use case: This will allow users to honor HTTP 429 (rate limiting), which is currently not supported.

jvrplmlmn avatar May 25 '22 11:05 jvrplmlmn

Hey folks, seems like CI is failing on an acceptance test: TestResolvedFilter

I am not entirely familiar with the test process for Alertmanager, for completeness this is what I have been running for both main and this branch, and in both cases the test suite fails for me:

  • make build-all
  • make common-test

Could I get a hint on how I could troubleshoot if that is related to this changes? Much appreciated!

jvrplmlmn avatar May 30 '22 10:05 jvrplmlmn

Just to say quickly: 429 errors can come with a Retry-After header telling you how long to wait before retrying. Perhaps we could extend the retrying logic to look for this? I see it's using an exponential ticker now though, so might be slightly hairy to implement - if so it could legitimately be delayed to a followup perhaps.

The rationale is that it would be nice if you're being rate limited to back off for the amount of time the server says to, more chance of not being limited the next time.

iainlane avatar Jun 01 '22 11:06 iainlane

Due to the clustering nature of alertmanager, it would not be possible to use retry after, which is why I am not sure we should accept this pull request

roidelapluie avatar Jun 01 '22 13:06 roidelapluie

If I understand correctly, the PagerDuty and Opsgenie notifiers already check for 429's on the Retrier and there we don't check the Retry-After header. For those integrations, the retry strategy is the exponential backoff.

jvrplmlmn avatar Jun 09 '22 08:06 jvrplmlmn

Yes. However, they are cloud services.

For alertmanager webhooks, we expect them to be written specifically for alertmanager, therefore never return 429's.

roidelapluie avatar Jun 09 '22 09:06 roidelapluie

Due to the clustering nature of alertmanager,

Out of curiosity, what am I missing from this rationale? I can't picture how the clustering logic creates a problem to retry. However, I can see how this conflicts in non-intuitive ways with things like the repeat_interval.

With clustering, under partitions, the price to pay will always be duplicate notifications. If the downstream system is under pressure that's up to the operator's configuration and you have all the knobs you need to tweak and adjust as necessary.

I want to hear's @roidelapluie opinion first to make sure I understand his angle, but in my eyes, this kind of change makes sense (mostly because there's precedent with other integrations) - My understanding is that the Alertmanager philosophy is to allow custom integrations via Webhooks.

By including this functionality, we're allowing webhook integrations to have similar features to other receivers we support. I see this is a net positive, as there'd be less of a need to include proprietary receivers in the Alertmanager and thus less maintenance overhead.

gotjosh avatar Jun 10 '22 08:06 gotjosh

After taking a closer look, I do see a small problem with this - and is the fact that we'll try indefinitely (see here) so for the implementation side, I think this might need to be paired (for the Webhooks at least) with a maximum elapsed time that we're willing to retry the notification for.

But as I said, I'll wait to hear @roidelapluie opinion's before we move there.

gotjosh avatar Jun 10 '22 09:06 gotjosh

@jvrplmlmn I've discussed this with @roidelapluie offline and have agreed that we can move forward.

Can you please extend this Pull Request with a maximum try time for webhooks? I don't think we need to make it configurable from the get-go and for now can be just a sensible default.

gotjosh avatar Jun 27 '22 11:06 gotjosh