Branding: improve IO resilience

Open kwzrd opened this issue 1 year ago • 1 comments

trafficstars

Closes #2808

This PR aims to improve the resiliency of the Branding manager cog. It is a rework of #2869, which I've closed and re-implemented in this branch in adherence with the reviewer's request to use the tenacity lib.

Context

The Branding manager automatically discovers events once per day or when instructed to. During each such discovery, it makes multiple HTTP requests to the GitHub API. First, it fetches a list of all events in the Branding repository, and then it fetches those events one-by-one.

Currently, if the bot fails to fetch an event, it simply skips it. This seemed useful initially, as we didn't want the bot to stop the discovery iteration if just one event is badly configured. In practice, however, this causes undesired behaviour: if the currently active event is skipped due to a 5xx error, the bot may reset the guild's branding to the fallback event, despite the event still being in progress.

This PR makes adjustments to the error handling methodology. Now, if any of the requests fail, the whole discovery iteration is aborted & retried the following day (or manually via the sync command). The last valid state remains in the Redis cache, so the cog will continue to work just fine.

Additionally, 5xx errors are now retried up to 5 times, with exponential backoff. This prevents discovery iterations from being cancelled due to stray server errors.

Implementation

The solution is fairly simple. The repository class no longer swallows IO/deserialization errors, and instead lets them propagate to the cog. The cog mostly already handles exceptions coming from the repo, I just added error handling where it was missing.

The calendar refresh can now fail (well, it could always fail, but the error was hidden). I added an embed response to inform the user of the result:

Because of this, it no longer automatically invokes the calendar view after the refresh - it seemed noisy. But feel free to propose other solutions.

I've used the Apache 2.0 licensed tenacity library for the retry with backoff.

Oct 09 '24 17:10 kwzrd

For testing purposes, you can do this. :trollface:

async with self.bot.http_session.get(...) as response:
    response.status = 500
    _raise_for_status(response)

The cog is overall not very testable. I have some ideas on how to improve it, but I'd rather leave it for a separate PR.

Oct 09 '24 18:10 kwzrd

bot bot copied to clipboard

Branding: improve IO resilience

Context

Implementation

bot
bot copied to clipboard