frontend Investigate Why Preview Is Scaling Up To 10x Minimum

Since adding scaling alarms to Preview on the 1st of August, to allow it to scale, we've seen spikes in scaling up to ten times the minimum capacity. For example, the minimum is 3 boxes, and we've seen scaling take it up to ~29 boxes.

This is a problem, because the max capacity is also only 10x the minimum, i.e. 30 boxes. If a deploy runs while we're scaled up this much, it fails, because the deploy needs to double the boxes but it doesn't have the headroom.

Aug 09 '24 15:08 JamieB-gu

fyi @SiAdcock @marjisound

Feb 28 '25 15:02 arelra

The PROD stack doesn't look like it has alarms attached to the either the up or down scaling policy. This would suggest the scaling policies aren't actually activated, however we do see the desired capacity creeping up over time.

May 29 '25 13:05 arelra

We noticed preview regularly hits 100% CPU, either there is an issue with a periodic process on the box or we need to bump the instance type to give it more compute.

May 29 '25 13:05 arelra

Preview makes a lot of requests to CAPI to populate the most viewed / deeply read components. In some cases, many of these requests fail, tripping the circuit breaker. This leads to the healthcheck failing for the instance, causing it to be terminated. At present, this seems to happen at least once every 10 minutes.
The circuit breaker only seems to open on the preview app
- we noticed the errors detail timeouts to the iam-preview domain
Preview scales up during deployments, and then slowly shrinks capacity. However, we have failing healthchecks with some instances while this is happening. At this point, the scaling down stops.
We also noticed instances are launched to balance the instances in the AZs, at which point scaling down stops:

Jun 11 '25 10:06 SiAdcock

Updated Preview to t4g.small: https://github.com/guardian/platform/pull/1873

Doesn't seem to have made much difference, instances are still cycling, but lets confirm over a longer time period.

EDIT: reverted

Jun 12 '25 16:06 arelra

Raised issue with CAPI / CoPip (?) here: https://chat.google.com/room/AAAAgU85_rM/8Uv0clujiwA/8Uv0clujiwA?cls=10

Jun 12 '25 16:06 arelra