ol-infrastructure
ol-infrastructure copied to clipboard
Automatic instance refresh is needed for XPro #4180
Description/Context
The intent was that deploys would happen often enough that they would outrun instance credential expiration. This isn't happening with XPro and its long release cycle.
We need to refresh instances automatically on a regular schedule to prevent this.
Plan/Design
See https://github.com/mitodl/ol-infrastructure/issues/2164 for an example of a pipeline that automatically refreshes instances on a schedule using Concourse's schedule task.
AWS docs:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Filtering.html#Filtering_Resources_CLI
Current (not working :) AWS CLI query:
aws autoscaling describe-auto-scaling-groups --color on --filters "Name=tag:Name,Values=edxapp-worker-xpro-ci" "Name=tag:Environment,Values=xpro-ci" --query "AutoScalingGroups[*]"
(Note to future self: Check that my AWS cli creds are actually MIT :)
So, XPro lacks some tags that would make this job easier.
As it is, I can't figure a way to natively dis-ambiguate between web and worker ASGs. @blarghmatey points out we don't need to care for this use case but the way everything is currently coded does, and before I rewrite the world I'm going to talk to @Ardiea who wrote it and get his take :)
# feoh @ fulcrum in ~/src/mit/ol-infrastructure/src/ol_concourse/pipelines/infrastructure/xpro on git:cpatti_xpro_instance_refresh x ol-infrastructure-ha13_Yh9-py3.11 [17:38:05]
$ aws autoscaling describe-auto-scaling-groups --color on --filters "Name=tag:Application,Values=edxapp" "Name=tag:Environment,Values=xpro-ci" --query "AutoScalingGroups[*].AutoScalingGroupName"
[
"edxapp-web-autoscaling-group-1442034",
"edxapp-worker-autoscaling-group-9127991"
]
I can always pass the blob we get through jq - it's ugly but might do the needful.
Added edxapp_node_type tag so we can easily discern which are web and which are worker.
The fresh is starting successfully:
https://cicd.odl.mit.edu/teams/infrastructure/pipelines/instance-refresh-xpro/jobs/ci-web-instance-refresh/builds/3
Both instance types (web, worker) refresh properly but the pipeline tries to start both jobs at once and fails saying "a refresh is already in progress".
Works perfectly in CI, QA & Production.
One unfortunate effect of the fact that we added the new edxapp_node_type tag is that the tag won't officially roll out to XPro production via Pulumi until December.
So I added the tag manually to the two ASGs in xpro production. This is very low risk.
However on the off chance our ASGs are destroyed, we'll need to redo this by hand or the auto instance rotation pipeline will fail.
Given that the degraded state is just the state we've been living in for years, this feels like a reasonable trade off to me.