terraform-aws-gitlab-runner
terraform-aws-gitlab-runner copied to clipboard
Docker Autoscaler not working in AWS
trafficstars
Server: GitLab EE: v16.11.8-ee Client: v16.10.0 (Also tried v16.11.3)
Describe the bug
When using Docker Autoscaler executor, Runner Manager is unable to ssh into Worker with error key not found error.
To Reproduce
Steps to reproduce the behavior:
- Use the following basic main.tf -
source = "cattle-ops/gitlab-runner/aws"
version = "7.12.1"
environment = "gitlab-runners-fleet"
vpc_id = data.aws_vpc.vpc.id
subnet_id = data.aws_subnets.example_subnets.ids[0]
iam_permissions_boundary = "POLICY-PERMISSION-BOUNDARY"
runner_gitlab = {
url = "https://example.mycompany.com/"
runner_version = "16.10.0"
preregistered_runner_token_ssm_parameter_name = "example-gitlab-runners-fleet-preregistered-token"
}
runner_manager = {
maximum_concurrent_jobs = 10
}
runner_instance = {
name = "gitlab-run"
root_device_config = {
volume_size = 100
volume_type = "gp3"
}
ssm_access = true
}
runner_worker = {
ssm_access = true
max_jobs = 10
request_concurrency = 10
type = "docker-autoscaler"
environment_variables = ["AWS_REGION=us-west-2", "AWS_SDK_LOAD_CONFIG=true", "DOCKER_AUTH_CONFIG={\"auths\":{\"https://index.docker.io/v1/\":{\"auth\":\"${var.docker_auth_token}\"}}, \"credHelpers\": {\"${data.aws_caller_identity.current.account_id}.dkr.ecr.us-west-2.amazonaws.com\": \"ecr-login\"}}"]
}
runner_worker_docker_autoscaler_instance = {
root_size = 100
volume_type = "gp3"
monitoring = true
}
runner_worker_docker_autoscaler_role = {
policy_arns = ["arn:aws:iam::1111111111:policy/somepolicy", "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess"]
}
runner_worker_docker_autoscaler_asg = {
on_demand_percentage_above_base_capacity = 0
spot_allocation_strategy = "on-demand-price"
enable_mixed_instance_policy = true
idle_time = 600
subnet_ids = data.aws_subnets.example_subnets.ids
types = ["c5.large", "c5.xlarge", "c5.2xlarge", "c5.4xlarge"]
volume_type = "gp3"
private_address_only = true
ebs_optimized = true
root_size = 100
sg_ingresses = [
{
description = "Allow all traffic within VPC and across local (TEST PURPOSE)"
from_port = 0
to_port = 65535
protocol = "tcp"
cidr_blocks = ["10.0.0.0/8"]
}
]
}
runner_worker_docker_options = {
volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]
}
runner_worker_docker_autoscaler = {
connector_config_user = "ubuntu"
}
runner_ami_owners = ["2222222222"]
runner_ami_filter = {
"tag:Name" = ["example-amazon-linux-ami"]
}
runner_worker_docker_autoscaler_ami_owners = ["1111111111"]
runner_worker_docker_autoscaler_ami_filter = {
"tag:Name" = ["example-ubuntu-ami"]
}
runner_worker_docker_autoscaler_autoscaling_options = [
{
periods = ["* * * * *"]
timezone = "UTC"
idle_count = 1
idle_time = "600s"
scale_factor = 2
}
]
debug = {
trace_runner_user_data = true
write_runner_config_to_file = true
write_runner_user_data_to_file = true
}
}
- Check pipeline invoked to use this Runner and you will see following error -
ERROR: Failed to remove network for build
ERROR: Preparation failed: preparing environment: dial ssh: after retrying 30 times: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
- When you login to both Runner Manager and Worker and look into logs, you will find "key not found" error when manager is trying to connect to worker - (Please note
authorized_keysfile hasrunner-worker-keypublic-key added, but I believe keypair in the runner manager from which ssh happens is missing)
sshd[3781]: debug1: trying public key file /home/ubuntu/.ssh/authorized_keys
sshd[3781]: debug1: fd 9 clearing O_NONBLOCK
sshd[3781]: debug2: key not found
Expected behavior
Runner Manager has to connect to Worker without errors and run the job