alertmanager
alertmanager copied to clipboard
notify/webhook: allow configuring HTTP status codes to retry on
Introduces a new webhook config retry_codes
which is a list of
additional HTTP codes that the Webhook notifier should retry on.
A specific use case: This will allow users to honor HTTP 429 (rate limiting), which is currently not supported.
Hey folks, seems like CI is failing on an acceptance test: TestResolvedFilter
I am not entirely familiar with the test process for Alertmanager, for completeness this is what I have been running for both main
and this branch, and in both cases the test suite fails for me:
-
make build-all
-
make common-test
Could I get a hint on how I could troubleshoot if that is related to this changes? Much appreciated!
Just to say quickly: 429 errors can come with a Retry-After header
telling you how long to wait before retrying. Perhaps we could extend the retrying logic to look for this? I see it's using an exponential ticker now though, so might be slightly hairy to implement - if so it could legitimately be delayed to a followup perhaps.
The rationale is that it would be nice if you're being rate limited to back off for the amount of time the server says to, more chance of not being limited the next time.
Due to the clustering nature of alertmanager, it would not be possible to use retry after, which is why I am not sure we should accept this pull request
If I understand correctly, the PagerDuty and Opsgenie notifiers already check for 429's on the Retrier and there we don't check the Retry-After
header. For those integrations, the retry strategy is the exponential backoff.
Yes. However, they are cloud services.
For alertmanager webhooks, we expect them to be written specifically for alertmanager, therefore never return 429's.
Due to the clustering nature of alertmanager,
Out of curiosity, what am I missing from this rationale? I can't picture how the clustering logic creates a problem to retry. However, I can see how this conflicts in non-intuitive ways with things like the repeat_interval
.
With clustering, under partitions, the price to pay will always be duplicate notifications. If the downstream system is under pressure that's up to the operator's configuration and you have all the knobs you need to tweak and adjust as necessary.
I want to hear's @roidelapluie opinion first to make sure I understand his angle, but in my eyes, this kind of change makes sense (mostly because there's precedent with other integrations) - My understanding is that the Alertmanager philosophy is to allow custom integrations via Webhooks.
By including this functionality, we're allowing webhook integrations to have similar features to other receivers we support. I see this is a net positive, as there'd be less of a need to include proprietary receivers in the Alertmanager and thus less maintenance overhead.
After taking a closer look, I do see a small problem with this - and is the fact that we'll try indefinitely (see here) so for the implementation side, I think this might need to be paired (for the Webhooks at least) with a maximum elapsed time that we're willing to retry the notification for.
But as I said, I'll wait to hear @roidelapluie opinion's before we move there.
@jvrplmlmn I've discussed this with @roidelapluie offline and have agreed that we can move forward.
Can you please extend this Pull Request with a maximum try time for webhooks? I don't think we need to make it configurable from the get-go and for now can be just a sensible default.