terraform-aws-gitlab-runner
terraform-aws-gitlab-runner copied to clipboard
'remove-gitlab-registration' script can invalidate in-use runner authentication token
Describe the bug
We're in the process of upgrading to 7.2.2
+ the new runner registration workflow, and discovered an unfortunate race condition with remove-gitlab-registration
.
Sometimes when cycling the agent instance, the AWS ASG will spin up the new instance before tearing down the old one. In that case, the following sequence of events can occur:
Old Instance | New Instance |
---|---|
starts | n/a |
obtains Runner Token 1 from Gitlab |
n/a |
adds Runner Token 1 to SSM |
n/a |
running | n/a |
running | starts |
running | fetches and validates Runner Token 1 from SSM |
running | finishes startup |
begins shutdown | running |
remove_gitlab_registration.sh runs and invalidates Runner Token 1 |
running |
finishes shutdown | receives 403 from Gitlab because Runner Token 1 is now invalid |
To Reproduce
The exact steps to reproduce seem to be tricky because the ASG seems inconsistent in how it's cycling the instances, but I believe it's:
- Upgrade module to
7.2.2
- Use the new runner registration workflow in the gitlab-runner.tftpl script
- To enable this, we're passing in:
-
runner_gitlab.access_token_secure_parameter_store_name
-
runner_gitlab_registration_config.type
-
runner_gitlab_registration_config.project_id
- Run
terraform apply
- Wait for the ASG to stabilize
- Make a change to the agent's user-data (e.g. by modifying
runner_install.pre_install_script
), and re-runterraform apply
- Wait for the ASG to stabilize
- Verify that the new agent instance cannot contact Gitlab because of
Forbidden
errors
Expected behavior
The new agent instance can continue to run jobs from Gitlab.
Additional context
As a concrete example, the new instance fetched and validated the token at Jan 05 18:49:04
, but the the original instance then started shutdown at Jan 05 18:49:17
and invalidate the token, after which the new instance started logging the following:
Jan 05 18:50:51 gitlab-runner[2957]: {"level":"error","msg":"Checking for jobs... forbidden","runner":"<redacted>","status":"POST https://gitlab.com/api/v4/jobs/request: 403 Forbidden","time":"2024-01-05T18:50:51Z"}
Jan 05 18:50:54 gitlab-runner[2957]: {"level":"error","msg":"Checking for jobs... forbidden","runner":"<redacted>","status":"POST https://gitlab.com/api/v4/jobs/request: 403 Forbidden","time":"2024-01-05T18:50:54Z"}
Jan 05 18:50:57 gitlab-runner[2957]: {"level":"error","msg":"Checking for jobs... forbidden","runner":"<redacted>","status":"POST https://gitlab.com/api/v4/jobs/request: 403 Forbidden","time":"2024-01-05T18:50:57Z"}
Jan 05 18:50:57 gitlab-runner[2957]: {"level":"error","msg":"Runner \"<redacted>\" is unhealthy and will be disabled for 1h0m0s seconds!","time":"2024-01-05T18:50:57Z","unhealthy_requests":3,"unhealthy_requests_limit":3}
As a temporary workaround, we're passing the following into runner_install.start_script
:
# This can be removed once https://github.com/cattle-ops/terraform-aws-gitlab-runner/issues/1062 is resolved.
echo 'Disabling "Remove the GitLab Runner from GitLab at shutdown" job...'
chmod a-x /opt/remove_gitlab_registration.sh || true
systemctl disable remove-gitlab-registration.service || true
systemctl stop remove-gitlab-registration.service || true
echo 'Done!'