wellcomecollection.org
wellcomecollection.org copied to clipboard
Restart tasks automatically
We sometimes get updown alerts that follow a particular pattern. We get the down alert followed almost immediately by the recovery alert, repeating frequently. They are normally accompanied by Cloudfront 504 (Gateway Timeout) errors for various urls. It seems to only be Prismic fed pages that are affected, but the Prismic APIs are fine and there are no related Prismic errors in the logs. The affected pages generally seem fine when visited, but can sometimes be slow to load.
Cause: It appears to be something going wrong with only one of the servers. The apparent intermittent nature of the problem being a result of whether or not the request is routed to the problematic server or not.
Restarting the tasks has always fixed this issue, so we could use a healthcheck to restart tasks automatically.
@jamieparkinson could you fill in the detail of exactly what needs doing please?
Closing as lost context.