terraform-aws-gitlab-runner
terraform-aws-gitlab-runner copied to clipboard
Spot Fleet doesn't work as expected
Describe the bug
I list multiple EC2 instance types for Spot Fleet workers, but only one instance type used to generate spot requests. The spot request type is "instance" but I believe it should be of "fleet" type to be able to request multiple instance types. I've tried with docker_machine_version = "0.16.2-gitlab.19-cki.2" and 0.16.2-gitlab.19-cki.4 - same result.
To Reproduce
Configure the module similar to Scenario: Use of Spot Fleet from documentation, specify several instance types. Observe the same instance type launched for all jobs.
Expected behavior
AWS spot requests created should be of "fleet" type with multiple EC2 instance types.
Configuration used
My terraform config
terraform:module "gitlab-runner" {
source = "npalm/gitlab-runner/aws"
version = "v7.3.1"
environment = var.gitlab_environment
vpc_id = var.aws_vpc_id
subnet_id = element(var.aws_private_subnets, 0)
runner_cloudwatch = {
enable = true
retention_days = 60
}
runner_gitlab = {
url = var.gitlab_url
}
runner_gitlab_registration_config = {
registration_token = var.gitlab_registration_token
tag_list = var.gitlab_tags
description = var.runners_description
locked_to_project = "true"
run_untagged = "false"
maximum_timeout = "7200"
}
runner_instance = {
name = var.runners_name
type = "t3a.large"
ssm_access = true
root_device_config = {
volume_size = 50 # GiB
}
}
runner_install = {
amazon_ecr_credential_helper = true
docker_machine_version = "0.16.2-gitlab.19-cki.2"
}
runner_worker = {
type = "docker+machine"
ssm_access = true
}
runner_worker_docker_machine_fleet = {
enable = true
}
runner_worker_docker_machine_instance = {
types = ["t3a.large", "t3.large", "m5a.large", "m5.large", "m6a.large"]
subnet_ids = var.aws_private_subnets
start_script = file("${path.module}/worker_userdata.sh")
volume_type = "gp3"
root_size = 50
}
runner_worker_docker_options = {
privileged = true
volumes = [
"/var/run/docker.sock:/var/run/docker.sock",
"/gitlab-runner/docker:/root/.docker",
"/gitlab-runner/ssh:/root/.ssh:ro",
"/root/.pypirc:/root/.pypirc",
"/root/.npmrc:/root/.npmrc"
]
}
}
Hello @Leonidimus
I understand what you are asking, but this is not how this module works. Let me explain.
If you want to see what the code do, you can check in your CloudTrail > Event history and look for CreateFleet events.
You will see something like this:
"eventTime": "2024-03-13T18:50:12Z",
"eventSource": "ec2.amazonaws.com",
"eventName": "CreateFleet",
"awsRegion": "eu-west-3",
"sourceIPAddress": ""***********:",
"userAgent": "aws-sdk-go/1.44.153 (go1.12.9; linux; amd64)",
"requestParameters": {
"CreateFleetRequest": {
"TargetCapacitySpecification": {
"DefaultTargetCapacityType": "spot",
"TotalTargetCapacity": 1
},
"Type": "instant",
"SpotOptions": {
"AllocationStrategy": "price-capacity-optimized",
"MaxTotalPrice": "0.50"
},
The important part is "TotalTargetCapacity": 1 and "Type": "instant". It means that you want 1 instance, and the Fleet is destroyed after the instance is created. "AllocationStrategy": "price-capacity-optimized" means that you want the best price with the best capacity.
In the launchTemplate configuration, you will see your choice of instance types:
"LaunchTemplateConfigs": {
"LaunchTemplateSpecification": {
"LaunchTemplateName": "gitlab-runner-dev-shr-small-ai-worker-20230510162620868300000001",
"Version": "$Latest"
},
"Overrides": [
{
"tag": 1,
"SubnetId": "subnet-"***********:",
"InstanceType": "t3a.medium"
},
{
"tag": 2,
"SubnetId": "subnet-"***********:",
"InstanceType": "t3a.medium"
},
{
"tag": 3,
"SubnetId": "subnet-"***********:",
"InstanceType": "t3a.medium"
},
{
"tag": 4,
"SubnetId": "subnet-"***********:",
"InstanceType": "t3.medium"
},
{
"tag": 5,
"SubnetId": "subnet-"***********:",
"InstanceType": "t3.medium"
},
{
"tag": 6,
"SubnetId": "subnet-"***********:",
"InstanceType": "t3.medium"
},
{
"tag": 7,
"SubnetId": "subnet-"***********:",
"InstanceType": "c5a.large"
},
{
"tag": 8,
"SubnetId": "subnet-"***********:",
"InstanceType": "c5a.large"
},
{
"tag": 9,
"SubnetId": "subnet-"***********:",
"InstanceType": "c5a.large"
},
{
"tag": 10,
"SubnetId": "subnet-"***********:",
"InstanceType": "t3.large"
},
{
"tag": 11,
"SubnetId": "subnet-"***********:",
"InstanceType": "t3.large"
},
{
"tag": 12,
"SubnetId": "subnet-"***********:",
"InstanceType": "t3.large"
},
{
"tag": 13,
"SubnetId": "subnet-"***********:",
"InstanceType": "c6a.large"
},
{
"tag": 14,
"SubnetId": "subnet-"***********:",
"InstanceType": "c6a.large"
},
{
"tag": 15,
"SubnetId": "subnet-"***********:",
"InstanceType": "c6a.large"
},
{
"tag": 16,
"SubnetId": "subnet-"***********:",
"InstanceType": "c5d.large"
},
{
"tag": 17,
"SubnetId": "subnet-"***********:",
"InstanceType": "c5d.large"
},
{
"tag": 18,
"SubnetId": "subnet-"***********:",
"InstanceType": "c5d.large"
}
],
In my configuration we use 6 different types of instances, in 3 AZs.
The Fleet will always launch the same kind of instances, because it thinks it's the best right know in term of price and capacity. If the capacity become low, it will switch automatically the type or AZ while launching the next docker+machine requested.
This is NOT perfect, because I am sure you may want to spread the instances right from the start with multiple types, to reduce the chance of multiple instances being retaken at the same time. Unfortunately, this is not how the software was developed, and we are limited by the VERY OLD and deprecated docker+machine code base :-) .
In any way, we use this feature for our production runner fleet, launching 10k+ jobs per day for over 200+ developers, and this is running like a charm on eu-west-3, with very few availability incidents.
Best regards,
Do not hesitate if you have any additional questions.
You may also want to improve cki codebase if you have some ideas, I will be very happy to test any new release in our setup.
@Tiduster Thanks for explaining that.