terraform-aws-gitlab-runner icon indicating copy to clipboard operation
terraform-aws-gitlab-runner copied to clipboard

'remove-gitlab-registration' script can invalidate in-use runner authentication token

Open schmidt-galen-heb opened this issue 1 year ago • 5 comments

Describe the bug

We're in the process of upgrading to 7.2.2 + the new runner registration workflow, and discovered an unfortunate race condition with remove-gitlab-registration.

Sometimes when cycling the agent instance, the AWS ASG will spin up the new instance before tearing down the old one. In that case, the following sequence of events can occur:

Old Instance New Instance
starts n/a
obtains Runner Token 1 from Gitlab n/a
adds Runner Token 1 to SSM n/a
running n/a
running starts
running fetches and validates Runner Token 1 from SSM
running finishes startup
begins shutdown running
remove_gitlab_registration.sh runs and invalidates Runner Token 1 running
finishes shutdown receives 403 from Gitlab because Runner Token 1 is now invalid

To Reproduce

The exact steps to reproduce seem to be tricky because the ASG seems inconsistent in how it's cycling the instances, but I believe it's:

  1. Upgrade module to 7.2.2
  2. Use the new runner registration workflow in the gitlab-runner.tftpl script
    • To enable this, we're passing in:
    • runner_gitlab.access_token_secure_parameter_store_name
    • runner_gitlab_registration_config.type
    • runner_gitlab_registration_config.project_id
  3. Run terraform apply
  4. Wait for the ASG to stabilize
  5. Make a change to the agent's user-data (e.g. by modifying runner_install.pre_install_script), and re-run terraform apply
  6. Wait for the ASG to stabilize
  7. Verify that the new agent instance cannot contact Gitlab because of Forbidden errors

Expected behavior

The new agent instance can continue to run jobs from Gitlab.

Additional context

As a concrete example, the new instance fetched and validated the token at Jan 05 18:49:04 , but the the original instance then started shutdown at Jan 05 18:49:17 and invalidate the token, after which the new instance started logging the following:

Jan 05 18:50:51 gitlab-runner[2957]: {"level":"error","msg":"Checking for jobs... forbidden","runner":"<redacted>","status":"POST https://gitlab.com/api/v4/jobs/request: 403 Forbidden","time":"2024-01-05T18:50:51Z"}
Jan 05 18:50:54 gitlab-runner[2957]: {"level":"error","msg":"Checking for jobs... forbidden","runner":"<redacted>","status":"POST https://gitlab.com/api/v4/jobs/request: 403 Forbidden","time":"2024-01-05T18:50:54Z"}
Jan 05 18:50:57 gitlab-runner[2957]: {"level":"error","msg":"Checking for jobs... forbidden","runner":"<redacted>","status":"POST https://gitlab.com/api/v4/jobs/request: 403 Forbidden","time":"2024-01-05T18:50:57Z"}
Jan 05 18:50:57 gitlab-runner[2957]: {"level":"error","msg":"Runner \"<redacted>\" is unhealthy and will be disabled for 1h0m0s seconds!","time":"2024-01-05T18:50:57Z","unhealthy_requests":3,"unhealthy_requests_limit":3}

As a temporary workaround, we're passing the following into runner_install.start_script:

# This can be removed once https://github.com/cattle-ops/terraform-aws-gitlab-runner/issues/1062 is resolved.
echo 'Disabling "Remove the GitLab Runner from GitLab at shutdown" job...'
chmod a-x /opt/remove_gitlab_registration.sh || true
systemctl disable remove-gitlab-registration.service || true
systemctl stop remove-gitlab-registration.service || true
echo 'Done!'

schmidt-galen-heb avatar Jan 05 '24 19:01 schmidt-galen-heb