wellcomecollection.org icon indicating copy to clipboard operation
wellcomecollection.org copied to clipboard

Restart tasks automatically

Open gestchild opened this issue 1 year ago • 1 comments

We sometimes get updown alerts that follow a particular pattern. We get the down alert followed almost immediately by the recovery alert, repeating frequently. They are normally accompanied by Cloudfront 504 (Gateway Timeout) errors for various urls. It seems to only be Prismic fed pages that are affected, but the Prismic APIs are fine and there are no related Prismic errors in the logs. The affected pages generally seem fine when visited, but can sometimes be slow to load.

Cause: It appears to be something going wrong with only one of the servers. The apparent intermittent nature of the problem being a result of whether or not the request is routed to the problematic server or not.

Restarting the tasks has always fixed this issue, so we could use a healthcheck to restart tasks automatically.

gestchild avatar Nov 15 '23 10:11 gestchild

@jamieparkinson could you fill in the detail of exactly what needs doing please?

gestchild avatar Nov 15 '23 10:11 gestchild

Closing as lost context.

kenoir avatar Oct 18 '24 07:10 kenoir