dotcom-rendering
dotcom-rendering copied to clipboard
Increased 5xx alarms across our services
This week we have seen a number of 5xx alarms:
- from the Frontend Article app - next steps in the incident retro doc - https://docs.google.com/document/d/1lR6M05vZIRcabmTJoGYosCgmoj5eVCki8tvaPc3pNpo/edit
- failed health checks since Feb 11th https://logs.gutools.co.uk/s/dotcom/goto/18a8d900-d266-11ee-9fbb-33f5a4d0cc15
- timeouts from the DR facia rendering app - subsided since https://github.com/guardian/dotcom-rendering/pull/10656
### Tasks
- [ ] Investigate whether the circuit breaker config changes caused the 5xx article responses
- [ ] Improve logging in DCR article rendering
- [ ] Reduce noise in DCR article rendering logging (mainly amp errors which should probably be warnings)
Current theory is that we are under-provisioning our article-rendering
app which is causing issues in the article
app. T4G small has a baseline performance of 20% CPU usage but we're at 40-45%.
We should try changing instance size/type or increasing the number of instances.
It's also worth noting that the amount of traffic being handled by the article-rendering
app is steadily increasing, perhaps in part due to the roll out of DCAR but that doesn't seem to explain the elevated traffic levels in full as we're seeing an increase in requests for web articles too.
Dashboard showing article-rendering
app requests split by platform here: https://metrics.gutools.co.uk/d/beb06834-f7c5-4e5c-9177-fc4ad6a71dcb/debugging-article-app-errors
We may want to consider using CPU utilisation as a scaling metric rather than latency, to keep our instances at an optimised level for performance.
I believe we have unlimited mode enabled on our account so we shouldn't be throttled to baseline.
We may want to consider using CPU utilisation as a scaling metric rather than latency, to keep our instances at an optimised level for performance.
I think CPU utilisation fits well with standard burstable instance types if you can stay under the baseline long enough for CPU credits to cover the time spent above the baseline. For unlimited mode adhering to CPU utilisation would allow us to avoid incurring the additional 'unlimited' flat rate costs. However both modes incur the expense of increasing the number of instances during periods of load to meet CPU utilisation targets.
Having reduced the cluster sizes recently and increased load, cpu utilisation has been increasing causing us to go above baseline and incur the additional flat rate cost.
We could increase the cluster size or scale more quickly but burstable instances (unlimited or not) effectively tie us to cpu utilisation which imo is a loose proxy (and hence less accurate) for latency which we have hard targets for.
Burstable instances also inhibit us from using the full capacity of the instances when under sustained load which we have been experiencing more of since adding new content types rendered by DCR.
We should try changing instance size/type or increasing the number of instances.
This also makes me think when we did the cost analysis of t4g vs c6.xlarge we (including me) did not consider the comparison between a cluster of unlimited mode t4gs running with the additional flat rate costs vs a smaller cluster of c6.xlarge where we can utilise a much higher cpu% with no additional cost.
This tradeoff is discussed in the AWS documentation: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances-unlimited-mode-concepts.html#when-to-use-unlimited-mode
Change log for article-rendering
configuration changes since 23/02/2024 is here for cost analysis purposes
After increasing the number of instances running and upping the size of the instances, the CPU usage did decrease a bit but not enough to accrue credits on a burstable instance type.
Changing the instance class to a non-burstable C7G instance type reduced the average response time (latency) and increased the CPU usage. Since we don't care about a higher CPU usage on a fixed instance type this seems to be an improvement. Our key metric is latency and the lower we can get this, the better.
Graph showing the change to CPU and latency after three deployments: increasing min instances, increasing instance size, changing instance class
![]()
The next step would be to try reducing the number of instances of C7G running to see the effect that has on the metrics. It would be worth performance testing this in the CODE environment before committing