dotcom-rendering Increased 5xx alarms across our services

This week we have seen a number of 5xx alarms:

from the Frontend Article app - next steps in the incident retro doc - https://docs.google.com/document/d/1lR6M05vZIRcabmTJoGYosCgmoj5eVCki8tvaPc3pNpo/edit
failed health checks since Feb 11th https://logs.gutools.co.uk/s/dotcom/goto/18a8d900-d266-11ee-9fbb-33f5a4d0cc15
timeouts from the DR facia rendering app - subsided since https://github.com/guardian/dotcom-rendering/pull/10656

### Tasks
- [ ] Investigate whether the circuit breaker config changes caused the 5xx article responses
- [ ] Improve logging in DCR article rendering
- [ ] Reduce noise in DCR article rendering logging (mainly amp errors which should probably be warnings)

Feb 23 '24 16:02 alinaboghiu

Current theory is that we are under-provisioning our article-rendering app which is causing issues in the article app. T4G small has a baseline performance of 20% CPU usage but we're at 40-45%.

We should try changing instance size/type or increasing the number of instances.

It's also worth noting that the amount of traffic being handled by the article-rendering app is steadily increasing, perhaps in part due to the roll out of DCAR but that doesn't seem to explain the elevated traffic levels in full as we're seeing an increase in requests for web articles too.

Dashboard showing article-rendering app requests split by platform here: https://metrics.gutools.co.uk/d/beb06834-f7c5-4e5c-9177-fc4ad6a71dcb/debugging-article-app-errors

We may want to consider using CPU utilisation as a scaling metric rather than latency, to keep our instances at an optimised level for performance.

Feb 23 '24 18:02 cemms1

I believe we have unlimited mode enabled on our account so we shouldn't be throttled to baseline.

We may want to consider using CPU utilisation as a scaling metric rather than latency, to keep our instances at an optimised level for performance.

I think CPU utilisation fits well with standard burstable instance types if you can stay under the baseline long enough for CPU credits to cover the time spent above the baseline. For unlimited mode adhering to CPU utilisation would allow us to avoid incurring the additional 'unlimited' flat rate costs. However both modes incur the expense of increasing the number of instances during periods of load to meet CPU utilisation targets.

Having reduced the cluster sizes recently and increased load, cpu utilisation has been increasing causing us to go above baseline and incur the additional flat rate cost.

We could increase the cluster size or scale more quickly but burstable instances (unlimited or not) effectively tie us to cpu utilisation which imo is a loose proxy (and hence less accurate) for latency which we have hard targets for.

Burstable instances also inhibit us from using the full capacity of the instances when under sustained load which we have been experiencing more of since adding new content types rendered by DCR.

We should try changing instance size/type or increasing the number of instances.

This also makes me think when we did the cost analysis of t4g vs c6.xlarge we (including me) did not consider the comparison between a cluster of unlimited mode t4gs running with the additional flat rate costs vs a smaller cluster of c6.xlarge where we can utilise a much higher cpu% with no additional cost.

This tradeoff is discussed in the AWS documentation: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances-unlimited-mode-concepts.html#when-to-use-unlimited-mode

Feb 26 '24 12:02 arelra

Change log for article-rendering configuration changes since 23/02/2024 is here for cost analysis purposes

Feb 27 '24 12:02 cemms1

After increasing the number of instances running and upping the size of the instances, the CPU usage did decrease a bit but not enough to accrue credits on a burstable instance type.

Changing the instance class to a non-burstable C7G instance type reduced the average response time (latency) and increased the CPU usage. Since we don't care about a higher CPU usage on a fixed instance type this seems to be an improvement. Our key metric is latency and the lower we can get this, the better.

Graph showing the change to CPU and latency after three deployments: increasing min instances, increasing instance size, changing instance class

The next step would be to try reducing the number of instances of C7G running to see the effect that has on the metrics. It would be worth performance testing this in the CODE environment before committing

Mar 01 '24 11:03 cemms1

dotcom-rendering dotcom-rendering copied to clipboard

Increased 5xx alarms across our services

dotcom-rendering
dotcom-rendering copied to clipboard