terraform-aws-gitlab-runner icon indicating copy to clipboard operation
terraform-aws-gitlab-runner copied to clipboard

[feat] Multiple runners (unfinished)

Open krzysztof-miemiec opened this issue 5 years ago β€’ 42 comments
trafficstars

Description

It's my first attempt to answer my own question - if it's ok to create many runners within a single EC2 "manager" instance, then let's do this. In theory, this allows to manage something like a "spot fleet" of docker-machine instances without necessarily spinning up multiple runner instances. Instead of defining aws-gitlab-runner multiple times, we only have to pass:

  module "gitlab_runner" {
    # ...
    runners = [
      { 
        instance_type = "c4.large"
        subnet_id = "subnet-a"
      },
      {
        instance_type = "c5.large"
        subnet_id = "subnet-a"
      },
      {
        instance_type = "c5d.large"
        subnet_id = "subnet-a"
      },
      {
        instance_type = "c5a.large"
        subnet_id = "subnet-a"
      },
      { 
        instance_type = "c4.large"
        subnet_id = "subnet-b"
      },
      {
        instance_type = "c5.large"
        subnet_id = "subnet-b"
      },
      {
        instance_type = "c5d.large"
        subnet_id = "subnet-b"
      },
      {
        instance_type = "c5a.large"
        subnet_id = "subnet-b"
      },
      { 
        instance_type = "c4.large"
        subnet_id = "subnet-c"
      },
      {
        instance_type = "c5.large"
        subnet_id = "subnet-c"
      },
      {
        instance_type = "c5d.large"
        subnet_id = "subnet-c"
      },
      {
        instance_type = "c5a.large"
        subnet_id = "subnet-c"
      },
    ]
  }

If I get it right, we may pay only $5-10/mo for single gitlab runner instance instead of $60-120/mo for 12 instances and it may improve existing, multi-ec2/multi-module solution for:

  • https://github.com/npalm/terraform-aws-gitlab-runner/issues/76
  • https://github.com/npalm/terraform-aws-gitlab-runner/issues/77

Migrations required

NO - It's supposed to work in backwards-compatible manner (no outputs or resources changed, yet spacing may be different in resulting config file).

Verification

TODO: I'll have to test the whole module first and get back to you with results.

Please mention the examples you have verified.

Documentation

TODO: It's a new feature, I'd like to hear authors/maintainers opinion before I proceed further πŸ™‡

Please ensure you update the README in _docs/README.md. The README.md in the root can be updated by running the script ci/bin/autodocs.sh, requires terraform-docs version 0.8+

krzysztof-miemiec avatar Aug 20 '20 20:08 krzysztof-miemiec

I'll have to make bigger changes, as I just hit 16kB limit on user-data that can be passed to EC2. I'll go with S3 config bucket to store configuration on. Potential benefit: we'll be able to see runner-config file as a separate resource.

krzysztof-miemiec avatar Aug 21 '20 15:08 krzysztof-miemiec

I'll have to make bigger changes, as I just hit 16kB limit on user-data that can be passed to EC2. I'll go with S3 config bucket to store configuration on. Potential benefit: we'll be able to see runner-config file as a separate resource.

Maybe a base64 encoding can do the job as well

npalm avatar Aug 22 '20 10:08 npalm

@krzysztof-miemiec Sounds really good and saves some money for the runner instances.

kayman-mk avatar Aug 22 '20 11:08 kayman-mk

@krzysztof-miemiec is the PR ready for review?

npalm avatar Aug 22 '20 14:08 npalm

I marked it in a title as WIP for now (or we can close it to not disturb you with notifications). I'm currently testing it out after bucket modifications (even gzip-compressed base64 user-data has the same limitation and we can't scale it to more than few runners, unfortunately 😞 ).

krzysztof-miemiec avatar Aug 22 '20 14:08 krzysztof-miemiec

I marked it in a title as WIP for now (or we can close it to not disturb you with notifications). I'm currently testing it out after bucket modifications (even gzip-compressed base64 user-data has the same limitation and we can't scale it to more than few runners, unfortunately 😞 ).

I really like that you direct create a PR, keeps everyting transparant and open. Which I love in open source. Today GitHub also as an option to set your PR in draft mode. Which blocks a merge. But WIP is als very (and even more) clear.

npalm avatar Aug 23 '20 09:08 npalm

I marked PR as draft, I somehow missed that function earlier... πŸ™ˆ

Now, it looks like saving config to S3 bucket makes a lot of sense. Today I made a step towards shortening downtime during runner changes. First, I noticed that when runner configuration gets changed, it does not trigger EC2 reboot, so I made use of aws_ssm_document & created 2 shell scripts which update the config file on existing EC2 and restart gitlab-runner on existing EC2. Everything is controlled via Terraform (which waits until provisioner does it's job) and runner changes are up after less than 10 seconds. πŸŽ‰ I also misconfigured the runner a few times (subnet/az mismatch, probably too small NVMe volume), so that's a good opportunity to test if new config works correctly.

krzysztof-miemiec avatar Aug 23 '20 18:08 krzysztof-miemiec

After a few days of testing, I think this PR is ready for initial review. I was able to apply configuration changes around 50 times and it worked every time. I also tested the case of fresh instance being started (no jq/gitlab-runner installed) and handled this case by skipping config modifications, as config is already processed after gitlab-runner is installed. The only downside I see now is that configuration update requires Terraform to run local script. Potential solution: watch for S3 events indicating config changes, & trigger proper SSM action using lambda function.

I'm not sure how to deal with breaking changes, I see two options:

  • make a breaking change (migration is as simple as terragrunt apply, config bucket, IAM permissions and SSM document are applied automatically, default options are retained)
  • support "old" way of dealing with config in parallel with new solution (as a result it may be really hard to maintain, I'm not sure what are the benefits of that πŸ€”)

Remaining tasks

  • Docs/example
  • I'm still trying to configure reverse proxy that would cache Docker images (and potentially other frequently accessed things, i.e. NPM packages) as per Gitlab docs: distributed container registry mirroring, as without it, you can't really use Gitlab runners in private VPC network (that's my current setup - I need to keep all runners under single IP/network). I can imagine seeing VPC NAT Gateway data transfer costs being higher than spot instance costs, given how many Docker images & NPM packages we pull every day. I'll probably create a separate issue to cover that case. Benefits of having solution for that problem: decrease in costs πŸ’°, ecology πŸƒ, faster local data transfer πŸš€

krzysztof-miemiec avatar Sep 01 '20 20:09 krzysztof-miemiec

@krzysztof-miemiec We do not use this docker image cache from Gitlab. We have a small EC2 instance running which serves a Sonatype Nexus for all Maven, Docker, ... artifacts. We need it for the artifacts our CI pipeline produces. So we use it as cache for all other artifacts too.

kayman-mk avatar Sep 01 '20 20:09 kayman-mk

@kayman-mk this sounds great, thanks a lot! I'll have a look at OSS version, as it looks like it may replace our existing ChartMuseum and cache multiple Docker repos and NPM repos. πŸŽ‰ As with every new tool I learn - I'm hyped already! πŸ˜…

krzysztof-miemiec avatar Sep 01 '20 20:09 krzysztof-miemiec

The only downside I see now is that configuration update requires Terraform to run local script

@krzysztof-miemiec if this solution requires a local script call it might break setups like mine where we use a multi account strategy and the machine were I run the Terraform changes from is not in the same account as the runners. That means that I need to assume a role in order to run local scripts so the change it could make the module unusable in a multi account scenario.

Keep in mind that I have not reviewed the PR and have very little context.

DavidGamba avatar Sep 03 '20 17:09 DavidGamba

With the new docker pull policies they announced last week a proxy / cache for docker images becomes really a need to. avoid you hit the pull limits.

npalm avatar Sep 03 '20 19:09 npalm

Just tried to run the PR in current state, having some findings see below.

  1. Got the following error while running the default example:
Error: Error running command '../../bin/reload-runner-config.sh default-auto123-asg ReloadGitlabConfiguration': exit status 254. Output:
An error occurred (ValidationException) when calling the GetCommandInvocation operation: 2 validation errors detected: Value 'null' at 'instanceId' failed to satisfy constraint: Member must have length greater than or equal to 10.; Value 'null' at 'instanceId' failed to satisfy constraint: Member must satisfy regular expression pattern: (^i-(\w{8}|\w{17})$)|(^mi-\w{17}$).

Once running the apply once again, all resources seems created succesfully.

  1. Next tried a runner block with several instances. but seems only my m5.large was used. Not having the time to check further.

  2. Checked the assume role, which I also use in some cases. This cause several issues with the script.

  • aws configuration seems not be picked up, als the script depends on the user configured region, would suggest to pass the terraform region in to the script and pass it to regon or AWS_REGION.
  • the profile is not picked-up. Here there are a few options. Passing the profile name, which requies a local profie needs to exists. This is not required by terraform. A maybe better alternative would be to run a assume_role and pass the role set in the provider (if set).

npalm avatar Sep 06 '20 14:09 npalm

@npalm thanks for taking time to check it! After a bit of tinkering, I came up with a solution which does not require local script. I requires a lot of AWS resources though, so I extracted the whole "technical part" of config to a separate module. As a result:

  • config is created on S3 bucket
  • CloudTrail is notified of that change
  • CloudWatch watches for events with a specific filter for S3 object
  • CloudWatch triggers a "Run Command" on SSM Document
  • Command is ran automatically on "manager" EC2 instance after configuration change in S3 bucket

Customizable things:

  • Config S3 bucket
  • Config file S3 key
  • CloudTrail S3 bucket
  • CloudTrail S3 prefix

The above only addresses points 1 and 3. I'll dig deeper to validate point 2 and update PR with a proper example.

krzysztof-miemiec avatar Sep 13 '20 12:09 krzysztof-miemiec

@npalm

I wasn't able to reliably start instances event with multiple runners defined, as "fake spot fleet" doesn't do it's job. GL runner doesn't retry provisioning docker-machine if single-instance-type EC2 pool is depleted. Once in a while, build failed because of that. To vastly improve that, I decided to look at alternative docker-machine drivers. I forked docker-machine-driver-terraform, published a slightly upgraded version of it (it now supports Terraform 0.12+) and tested it with my PR. You can see my runner configuration in examples/runner-custom-terraform-spot-fleet.

To sum up, if this PR gets merged, you'll be able to create a single runner agent which can spin up different runners (spot/non-spot) and even pass enough custom scripting to provision docker-machines with Terraform. This allows to create instances with spot fleets, which is a feature that couldn't land in docker-machine. 🀩

This solution also fixes point 2 of your remarks above. When I was testing previous version of my PR, it looked like GL runner randomly selected one of configurations, as I saw different instance types in my AWS Console. What's bad here is what I mentioned above - GL does not retry or re-check if configuration is correct and boots successfully (the fact that GL does not choose runners in any "smart" way is actually mentioned somewhere in GL docs πŸ€”). Job would fail if spot instance pool is depleted.

Terraform driver seems to make the whole thing nearly bulletproof (as long as your TF config is correct). This PR obviously became XXL in size. I hope you'll help me divide it if necessary.

krzysztof-miemiec avatar Sep 20 '20 16:09 krzysztof-miemiec

Looks good, will check later this week. Had a week holiday last week :)

npalm avatar Sep 29 '20 19:09 npalm

Sorry took a while, just ran a quick check. Agent is started, docker machines are created but not connected to the agent. So jobs are not started. In the example I missed the providers so used them from the default sample. I also had an issue with the VPC. This line was not correct azs = data.aws_availability_zones.available.names brackers should be removed since the vpc module returns a list. Also the ubuntu ami could not be selected since there was not a unique selection. Added most_recent = true to force the selection of one AMI. Still not working, docker machine where not started

NAME                                         ACTIVE   DRIVER      STATE     URL   SWARM   DOCKER   ERRORS
runner-8zfd4nrd-runner-1602623114-c3d5d0a4            terraform   Timeout

Updated the VPC to use private subnets instead of the public ones, same as in the default example. Now docker machines are starting and connectiong.

Queston: Is the cloudtrail config via SSM still used, seems not. Or do I miss something?

Question: can you share config how to spin up multiple agents? Should this work? I was not able to spin up multiple agents

  runners = [
    # Create a new runner for matrix of each subnet id and instance type
    for element in setproduct(local.subnet_ids, local.instance_types) : {
      subnet_id     = element[0]
      instance_type = element[1]
    }
  ]

@krzysztof-miemiec I really like the work you did. As long we can get the PR in a shape it does not conflict any other user I don't care about the XXL PR. Once we got i back we can see how to stucture it more and more.

npalm avatar Oct 13 '20 21:10 npalm

@npalm quick (organizational) update from my side - I had quite busy time recently and didn't have time to go through an example I provided and make sure it runs correctly. I'll take a deep dive in a couple weeks - I haven't forgotten about this and I'm glad you liked my work 😊

Some general thoughts (I probably wrote that already πŸ™ˆ):

πŸ‘‰ Spot instances don't work well when they're defined as runners array, but this feature can be used to setup various runner configurations with different tags (i.e. we can define it as [ { huge spot instance pool done with terraform inside }, { small, but stable instance pool for deployments }, { an instance pool with GPUs }]) πŸ‘‰ docker-machine-terraform-driver, after a few patches, does it's job quite well in spinning up multiple - I included a sample Terraform config already, however I have to check it one more time to ensure it matches what I used internally in my company. I tested that with default (just one element) runners definition.

... And some answers:

Question: Is the cloudtrail config via SSM still used, seems not. Or do I miss something?

I left the whole S3-to-CloudTrail-to-CloudWatch-to-SSM-to-EC2 config untouched (besides fixes), but let me check that one more time. CloudTrail is indirectly used by CloudWatch events (we have to set up CloudTrail to track S3 bucket actions as CW events).

Question: can you share config how to spin up multiple agents? Should this work? I was not able to spin up multiple agents

I had some issues with private-only IPs - quite big data transfer bill when EC2s download stuff(~100s of GBs/day in various artifacts/Docker containers) through VPC NAT Gateway, so EC2s in my fleet usually had public IPs (with no incoming traffic allowed in Security Groups). However private IPs are the right way for docker-machine communication πŸ€” Are you using Terraform driver or a regular one?

TODOs for me:

  • [ ] Resolve conflicts
  • [ ] Make sure that sample config corresponds to config I tested internally
  • [ ] Ensure that there are no leftovers in new config submodule
  • [ ] Apply fixes to AZs and AMI selection in examples
  • [ ] Check what's wrong with private IPs & why there are timeouts

krzysztof-miemiec avatar Oct 22 '20 05:10 krzysztof-miemiec

@krzysztof-miemiec I'd just like to check the status: do you plan to work on updating this PR and getting it merged in? Have you decided to not pursue this?

Fuuzetsu avatar Dec 01 '20 06:12 Fuuzetsu

@Fuuzetsu I definitely would like to get this merged, however I had very little spare time to do this recently. I'll probably come with new commit(s) around the end of December. In the meantime, I can confirm that my fork works well when combined with adjusted terraform driver, creating spot instances using various machine types and AZs (as in example).

krzysztof-miemiec avatar Dec 06 '20 17:12 krzysztof-miemiec

I've been trying this out today. It seems quite nice so far though I've only just started. I should say that I never used terraform-aws-gitlab-runner before and am starting from this PR.

There is a feature request that I would like to add to this PR as it seems to fit the general agenda: allow some of the things from gitlab_runner_registration_config to be part of the runners block. This can mean that each of the registered runners will have its own token.

My main motivation for this is to be able to assign different tags per runner. A simple use-case is creating runners with instance types like t3.micro for small jobs but m5.large for something larger. We can create and register such runners but there's no way to pick in .gitlab-ci.yml. You are allowed to specify tags key to filter but we only have one GitLab runner with multiple executors. Instead it'd be cool if each executor was registered by itself.

Fuuzetsu avatar Dec 09 '20 06:12 Fuuzetsu

A bug report: currently we have

data "aws_subnet" "runners" {
  id = var.subnet_id_runners
}

data "aws_availability_zone" "runners" {
  name = data.aws_subnet.runners.availability_zone
}
…
    aws_zone                  = data.aws_availability_zone.runners.name_suffix
…
      "amazonec2-zone=${aws_zone}",

This makes it impossible to use multiple AZs!

A fix is to simply allow aws_zone be set per runner:

data "aws_availability_zone" "subnets" {
  for_each = aws_subnet.foo
  name     = each.value.availability_zone
}
…
  runners = [
    for k, v in aws_subnet.foo : {
      name        = "${var.name}-${data.aws_availability_zone.subnets[k].name}-${var.env}"
      instance_id = "m5.large"
      subnet_id   = v.id
      aws_zone    = data.aws_availability_zone.subnets[k].name_suffix
    }
  ]

The data bit could live inside code to completely hide it from the user and it should Just Workβ„’.

EDIT: I see that this is the fault of subnet_id_runners which isn't coming from this PR. This PR has the opportunity to easily work around it though, just like we're allowed to specify different subnet_ids.

Fuuzetsu avatar Dec 09 '20 06:12 Fuuzetsu

Another report: applying today I got Error: Error creating CloudTrail: InsufficientS3BucketPolicyException: Incorrect S3 bucket policy is detected for bucket: <bucket-name> so there seems to be some race condition as re-running the apply gave:

  + resource "aws_cloudtrail" "cloudtrail" {

as the only resource left and it applied well.

Fuuzetsu avatar Dec 10 '20 02:12 Fuuzetsu

w.r.t https://github.com/npalm/terraform-aws-gitlab-runner/pull/249#issuecomment-741572105

A fix is to simply allow aws_zone be set per runner:

Please ignore this. You can do exactly that (set aws_zone for the runner) and it works properly. I think I messed something else up at the time.

Fuuzetsu avatar Dec 10 '20 03:12 Fuuzetsu

Another thing: The whole CloudWatch event machinery to pull in a new runner configuration does not work unless SSM is enabled (enable_runner_ssm_access = true). Without this, the instance is never considered as "Managed instance" in SSM terms and so the RunCommand is not able to target it. This should probably be set to true by default and/or there should be some guard that doesn't make the resources related to config updates if it's not enabled.

Fuuzetsu avatar Dec 10 '20 04:12 Fuuzetsu

This can mean that each of the registered runners will have its own token.

Not only does this allow to tag each executor in a different way, there is an additional advantage: we're also able to register multiple specific runners. Right now the choice is either single project, group or shared runner. If we want to have a specific runner for each project, this currently means we need a gitlab runner bastion for each.

So it seems that parametrising over the block per executor would be a huge improvement on multiple fronts while still only using a single bastion.

Fuuzetsu avatar Dec 11 '20 03:12 Fuuzetsu

Another report: changing the gitlab_runner_registration_token we have deployed once before has no effect. This is because when we initially registered with this value, in user_data we put the result in SSM:

token=$(aws ssm get-parameters --names "${secure_parameter_store_runner_token_key}" --with-decryption --region "${secure_parameter_store_region}" | jq -r ".Parameters | .[0] | .Value")
if [[ `echo ${runners_token}` == "__REPLACED_BY_USER_DATA__" && `echo $token` == "null" ]]
then
  token=$(curl --request POST -L "${runners_gitlab_url}/api/v4/runners" \
    --form "token=${gitlab_runner_registration_token}" \
    --form "tag_list=${gitlab_runner_tag_list}" \
    --form "description=${giltab_runner_description}" \
    --form "locked=${gitlab_runner_locked_to_project}" \
    --form "run_untagged=${gitlab_runner_run_untagged}" \
    --form "maximum_timeout=${gitlab_runner_maximum_timeout}" \
    --form "access_level=${gitlab_runner_access_level}" \
    | jq -r .token)
  aws ssm put-parameter --overwrite --type SecureString  --name "${secure_parameter_store_runner_token_key}" --value="$token" --region "${secure_parameter_store_region}"
fi

What this means that changing any of the parameters inside the if block has no useful effect (it user_data changes so it will remake the launch configuration and therefore the ASG and we'll get a new instance: but it's not going to do anything for us due to persisted global token state in SSM).

This can probably be addressed by storing what parameters we're using in the if block at token generation time and if these do not match when we retrieve the token (with metadata) then we should register again.

There does seem to be an attempt to not re-use the token when the gitlab runner instance gets cycled as evident here:

# Remove all running docker-machine instances
# Overwrite token in SSM with null and remove runner from Gitlab
stop() {
    logger "Removing Docker Machine Instances"
    for runner in $(docker-machine ls | awk '{print $1}' | grep runner); do docker-machine rm -y "$runner"; done
    logger "Removing Gitlab Runner Token"
    aws ssm put-parameter --overwrite --type SecureString  --name "${secure_parameter_store_runner_token_key}" --region "${secure_parameter_store_region}" --value="null" 2>&1 | logger &
    curl -sS --request DELETE "${runners_gitlab_url}/api/v4/runners" --form "token=$token" 2>&1 | logger &
    retval=\$?
    rm -f \$lockfile
    return \$retval
}

but this doesn't run when the instance is recreated, at least I don't see anything in CloudWatch. Perhaps it should be made into a systemd service that stops after the gitlab-runner service does? Still, we can't trust a clean shutdown so we should have a check at the start that data is all up to date too.

EDIT: Much better would be if changing these parameters didn't recreate instance but instead worked the same way as updating the config.toml: put these values in SSM/S3 somewhere and register events that can unregister old info and register new one. This way we don't have to re-make the whole ASG if we want to add extra tag to the runner. I suspect this becomes much more necessary when we are able to have multiple separately registered executors as I discussed earlier in the PR.

EDIT 2: I do see the instance unregistering sometimes (if I destroy the module rather than just change some parameters) so I 'm not sure what it depends on. But it's not reliable.

Fuuzetsu avatar Dec 11 '20 03:12 Fuuzetsu

Another issue:

  # module.gitlab_runner.aws_cloudwatch_log_group.environment[0] will be created
  + resource "aws_cloudwatch_log_group" "environment" {
      + arn               = (known after apply)
      + id                = (known after apply)
      + name              = "…"
      + retention_in_days = 0
      + tags              = {
          + "Environment" = "…"
          + "Name"        = "…"
        }
    }
Error: Creating CloudWatch Log Group failed: ResourceAlreadyExistsException: The specified log group already exists:  The CloudWatch Log Group '…' already exists.

So if instance starts first, it starts logging and creates the log group itself. Then terraform tries to do the same and fails. The group should be made before the ASG or something.

EDIT: I think this may happen a different way. When we destroy the module, we destroy the log group and anything that directly mentions it. But the instance is still not fully terminated so it logs and ends up re-creating the group. Now, the next time you try to apply the module, the group already exists and we're in trouble. One way around this is to disallow the instance from creating the group, that way terraform has the full control.

EDIT 2: Yeah, after trying a few cycles I think it's because it gets recreated by instance still terminating after TF initially destroys the group. You can see the group get re-made in CloudWatch.

Fuuzetsu avatar Dec 11 '20 04:12 Fuuzetsu

Today I realised there is very easy way to simplify the whole thing and have multiple executors. The main problem is about synchronising config against some source of truth. really, it's only the parameters used for registering executors and everything else is just part of the config in S3 that we send verbatim. We could put everything in SSM like we do with token but it means some complicated handling of changing all the parameters, making sure we properly register everything, deregister old tokens when changing params etc: generally a mess.

There's a much simpler way though: we already have a central source of truth, it's terraform itself (and its state file).

So simply we can register (and deregister) runners with curl for every runner and insert the tokens into config. This means the config we send is always complete and all the tokens are tracked properly and remade on parameter changes (and only if necessary: no unregistering every token because one param changed).

local-exec based resource to get tokens will just work. I may post a simple wrapper provider for it on Monday.

Fuuzetsu avatar Dec 11 '20 22:12 Fuuzetsu

I've made a start here https://github.com/tsurucapital/terraform-provider-gitlab/tree/gitlab-runner-register

This allows to register each executor/runner via terraform separately and get tokens for each which was probably the biggest hurdle. With that, we don't have to register on the actual machine and mess with SSM etc. I will polish it up tomorrow I hope and ask for inclusion in official GitLab provider. Still WIP.

Fuuzetsu avatar Dec 14 '20 07:12 Fuuzetsu