shoutrrr icon indicating copy to clipboard operation
shoutrrr copied to clipboard

Service rate limits and other temporary sending failures are not handled

Open JonasPf opened this issue 3 years ago • 4 comments

I'm running into a problem where I send too many alerts to Slack and it hits the rate limit: https://api.slack.com/docs/rate-limits#rate-limits__responding-to-rate-limiting-conditions

I wonder what the best way is to deal with this problem. I can see two options:

  1. In my own code, retry calling shoutrrr.Send() if it returns an error due to rate limiting. Problem here is that the only way to determine if the error is due to rate limiting is by parsing the error text.
  2. Shoutrrr itself could handle the rate limiting and retry sending itself. This doesn't sound desirable either because it would mean a simple shoutrrr.Send() could block for a long time (Slack docs mention 30 seconds until retry).

Any thoughts or ideas?

JonasPf avatar Mar 05 '21 16:03 JonasPf

Related, I got this error once from the OpsGenie service (it happened exactly once and I've been using my own shoutrrr branch with OpsGenie in production since Christmas):

OpsGenie notification returned 503 HTTP status code: upstream connect error or disconnect/reset before headers. reset reason: connection failure

According to the OpsGenie docs we should retry immediately when getting a 503: https://docs.opsgenie.com/docs/response (scroll to the bottom).

JonasPf avatar Mar 05 '21 18:03 JonasPf

The sending is done in a go routine, so it shouldn't block. With that said, I still think it makes more sense if this is handled outside of shoutrrr. I don't have any good arguments either way, other than I'm a bit hesitant to bring that additional complexity into the project. @piksel and @arnested, any thoughts?

simskij avatar Apr 23 '21 07:04 simskij

The sending is done in a go routine

Not for shoutrrr.Send() itself, since that is the super-basic API.

I think the router could manage it if we just add a common error like temporaryFailure. The problem would of course be that different service might need different back-off times, but perhaps OpsGenie wouldn't mind waiting 30 seconds if it means that the notification would arrive eventually...

piksel avatar Apr 23 '21 10:04 piksel

Sorry for the delay.

I also think retries should be kept out of the library. I'm not certain if there is one right way to do retries across all services and implementing the retry strategies per service doesn't seem like the right way either.

There are several libraries out there doing retries, i.e. https://github.com/matryer/try. But I'm not very experienced in implementing retries myself, so I can't really recommend any of them.

Regarding the problem with parsing the error text: we could consider adding better error messages (I'm thinking go1.13 errors with error wrapping and errors.Is()/errors.As()), but that would probably be a breaking change.

arnested avatar May 30 '21 13:05 arnested