terraform-aws-gitlab-runner icon indicating copy to clipboard operation
terraform-aws-gitlab-runner copied to clipboard

Spot Fleet doesn't work as expected

Open Leonidimus opened this issue 1 year ago • 1 comments
trafficstars

Describe the bug

I list multiple EC2 instance types for Spot Fleet workers, but only one instance type used to generate spot requests. The spot request type is "instance" but I believe it should be of "fleet" type to be able to request multiple instance types. I've tried with docker_machine_version = "0.16.2-gitlab.19-cki.2" and 0.16.2-gitlab.19-cki.4 - same result.

To Reproduce

Configure the module similar to Scenario: Use of Spot Fleet from documentation, specify several instance types. Observe the same instance type launched for all jobs.

Expected behavior

AWS spot requests created should be of "fleet" type with multiple EC2 instance types.

Configuration used

My terraform config terraform:
module "gitlab-runner" {
  source = "npalm/gitlab-runner/aws"
  version = "v7.3.1"
  environment = var.gitlab_environment
  vpc_id                   = var.aws_vpc_id
  subnet_id                = element(var.aws_private_subnets, 0)

  runner_cloudwatch = {
    enable = true
    retention_days = 60
  }

  runner_gitlab = {
    url = var.gitlab_url
  }

  runner_gitlab_registration_config = {
    registration_token = var.gitlab_registration_token
    tag_list           = var.gitlab_tags
    description        = var.runners_description
    locked_to_project  = "true"
    run_untagged       = "false"
    maximum_timeout    = "7200"
  }
  
  runner_instance = {
    name = var.runners_name
    type = "t3a.large"
    ssm_access = true
    root_device_config = {
      volume_size = 50 # GiB
    }
  }

  runner_install = {
    amazon_ecr_credential_helper = true
    docker_machine_version = "0.16.2-gitlab.19-cki.2"
  }

  runner_worker = {
    type = "docker+machine"
    ssm_access = true
  }

  runner_worker_docker_machine_fleet = {
    enable = true
  }

  runner_worker_docker_machine_instance = {
    types = ["t3a.large", "t3.large", "m5a.large", "m5.large", "m6a.large"]
    subnet_ids = var.aws_private_subnets
    start_script = file("${path.module}/worker_userdata.sh")
    volume_type = "gp3"
    root_size = 50
  }

  runner_worker_docker_options = {
    privileged = true
    volumes = [
      "/var/run/docker.sock:/var/run/docker.sock",
      "/gitlab-runner/docker:/root/.docker",
      "/gitlab-runner/ssh:/root/.ssh:ro",
      "/root/.pypirc:/root/.pypirc",
      "/root/.npmrc:/root/.npmrc"
    ]
  }
}

Leonidimus avatar Mar 07 '24 17:03 Leonidimus

Hello @Leonidimus

I understand what you are asking, but this is not how this module works. Let me explain.

If you want to see what the code do, you can check in your CloudTrail > Event history and look for CreateFleet events.

You will see something like this:

    "eventTime": "2024-03-13T18:50:12Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "CreateFleet",
    "awsRegion": "eu-west-3",
    "sourceIPAddress": ""***********:",
    "userAgent": "aws-sdk-go/1.44.153 (go1.12.9; linux; amd64)",
    "requestParameters": {
        "CreateFleetRequest": {
            "TargetCapacitySpecification": {
                "DefaultTargetCapacityType": "spot",
                "TotalTargetCapacity": 1
            },
            "Type": "instant",
            "SpotOptions": {
                "AllocationStrategy": "price-capacity-optimized",
                "MaxTotalPrice": "0.50"
            },

The important part is "TotalTargetCapacity": 1 and "Type": "instant". It means that you want 1 instance, and the Fleet is destroyed after the instance is created. "AllocationStrategy": "price-capacity-optimized" means that you want the best price with the best capacity.

In the launchTemplate configuration, you will see your choice of instance types:

"LaunchTemplateConfigs": {
                "LaunchTemplateSpecification": {
                    "LaunchTemplateName": "gitlab-runner-dev-shr-small-ai-worker-20230510162620868300000001",
                    "Version": "$Latest"
                },
                "Overrides": [
                    {
                        "tag": 1,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3a.medium"
                    },
                    {
                        "tag": 2,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3a.medium"
                    },
                    {
                        "tag": 3,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3a.medium"
                    },
                    {
                        "tag": 4,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.medium"
                    },
                    {
                        "tag": 5,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.medium"
                    },
                    {
                        "tag": 6,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.medium"
                    },
                    {
                        "tag": 7,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5a.large"
                    },
                    {
                        "tag": 8,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5a.large"
                    },
                    {
                        "tag": 9,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5a.large"
                    },
                    {
                        "tag": 10,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.large"
                    },
                    {
                        "tag": 11,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.large"
                    },
                    {
                        "tag": 12,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "t3.large"
                    },
                    {
                        "tag": 13,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c6a.large"
                    },
                    {
                        "tag": 14,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c6a.large"
                    },
                    {
                        "tag": 15,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c6a.large"
                    },
                    {
                        "tag": 16,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5d.large"
                    },
                    {
                        "tag": 17,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5d.large"
                    },
                    {
                        "tag": 18,
                        "SubnetId": "subnet-"***********:",
                        "InstanceType": "c5d.large"
                    }
                ],

In my configuration we use 6 different types of instances, in 3 AZs.

The Fleet will always launch the same kind of instances, because it thinks it's the best right know in term of price and capacity. If the capacity become low, it will switch automatically the type or AZ while launching the next docker+machine requested.

This is NOT perfect, because I am sure you may want to spread the instances right from the start with multiple types, to reduce the chance of multiple instances being retaken at the same time. Unfortunately, this is not how the software was developed, and we are limited by the VERY OLD and deprecated docker+machine code base :-) .

In any way, we use this feature for our production runner fleet, launching 10k+ jobs per day for over 200+ developers, and this is running like a charm on eu-west-3, with very few availability incidents.

Best regards,

Do not hesitate if you have any additional questions.

You may also want to improve cki codebase if you have some ideas, I will be very happy to test any new release in our setup.

cpatry-poly avatar Mar 14 '24 08:03 cpatry-poly

@Tiduster Thanks for explaining that.

kayman-mk avatar Mar 22 '24 10:03 kayman-mk