Investigate Why Preview Is Scaling Up To 10x Minimum
Since adding scaling alarms to Preview on the 1st of August, to allow it to scale, we've seen spikes in scaling up to ten times the minimum capacity. For example, the minimum is 3 boxes, and we've seen scaling take it up to ~29 boxes.
This is a problem, because the max capacity is also only 10x the minimum, i.e. 30 boxes. If a deploy runs while we're scaled up this much, it fails, because the deploy needs to double the boxes but it doesn't have the headroom.
fyi @SiAdcock @marjisound
The PROD stack doesn't look like it has alarms attached to the either the up or down scaling policy. This would suggest the scaling policies aren't actually activated, however we do see the desired capacity creeping up over time.
We noticed preview regularly hits 100% CPU, either there is an issue with a periodic process on the box or we need to bump the instance type to give it more compute.
-
Preview makes a lot of requests to CAPI to populate the most viewed / deeply read components. In some cases, many of these requests fail, tripping the circuit breaker. This leads to the healthcheck failing for the instance, causing it to be terminated. At present, this seems to happen at least once every 10 minutes.
-
The circuit breaker only seems to open on the preview app
- we noticed the errors detail timeouts to the iam-preview domain
-
Preview scales up during deployments, and then slowly shrinks capacity. However, we have failing healthchecks with some instances while this is happening. At this point, the scaling down stops.
-
We also noticed instances are launched to balance the instances in the AZs, at which point scaling down stops:
Updated Preview to t4g.small: https://github.com/guardian/platform/pull/1873
Doesn't seem to have made much difference, instances are still cycling, but lets confirm over a longer time period.
EDIT: reverted
Raised issue with CAPI / CoPip (?) here: https://chat.google.com/room/AAAAgU85_rM/8Uv0clujiwA/8Uv0clujiwA?cls=10