frontend icon indicating copy to clipboard operation
frontend copied to clipboard

Investigate Why Preview Is Scaling Up To 10x Minimum

Open JamieB-gu opened this issue 1 year ago • 6 comments

Since adding scaling alarms to Preview on the 1st of August, to allow it to scale, we've seen spikes in scaling up to ten times the minimum capacity. For example, the minimum is 3 boxes, and we've seen scaling take it up to ~29 boxes.

This is a problem, because the max capacity is also only 10x the minimum, i.e. 30 boxes. If a deploy runs while we're scaled up this much, it fails, because the deploy needs to double the boxes but it doesn't have the headroom.

JamieB-gu avatar Aug 09 '24 15:08 JamieB-gu

fyi @SiAdcock @marjisound

arelra avatar Feb 28 '25 15:02 arelra

The PROD stack doesn't look like it has alarms attached to the either the up or down scaling policy. This would suggest the scaling policies aren't actually activated, however we do see the desired capacity creeping up over time.

arelra avatar May 29 '25 13:05 arelra

We noticed preview regularly hits 100% CPU, either there is an issue with a periodic process on the box or we need to bump the instance type to give it more compute.

arelra avatar May 29 '25 13:05 arelra

  • Preview makes a lot of requests to CAPI to populate the most viewed / deeply read components. In some cases, many of these requests fail, tripping the circuit breaker. This leads to the healthcheck failing for the instance, causing it to be terminated. At present, this seems to happen at least once every 10 minutes.

  • The circuit breaker only seems to open on the preview app

  • Preview scales up during deployments, and then slowly shrinks capacity. However, we have failing healthchecks with some instances while this is happening. At this point, the scaling down stops.

  • We also noticed instances are launched to balance the instances in the AZs, at which point scaling down stops:

Image

SiAdcock avatar Jun 11 '25 10:06 SiAdcock

Updated Preview to t4g.small: https://github.com/guardian/platform/pull/1873

Doesn't seem to have made much difference, instances are still cycling, but lets confirm over a longer time period.

EDIT: reverted

arelra avatar Jun 12 '25 16:06 arelra

Raised issue with CAPI / CoPip (?) here: https://chat.google.com/room/AAAAgU85_rM/8Uv0clujiwA/8Uv0clujiwA?cls=10

arelra avatar Jun 12 '25 16:06 arelra