terraform-aws-gitlab-runner icon indicating copy to clipboard operation
terraform-aws-gitlab-runner copied to clipboard

Docker Autoscaler not working in AWS

Open chovdary123 opened this issue 1 year ago • 16 comments
trafficstars

Server: GitLab EE: v16.11.8-ee Client: v16.10.0 (Also tried v16.11.3)

Describe the bug

When using Docker Autoscaler executor, Runner Manager is unable to ssh into Worker with error key not found error.

To Reproduce

Steps to reproduce the behavior:

  1. Use the following basic main.tf -
  source  = "cattle-ops/gitlab-runner/aws"
  version = "7.12.1"

  environment = "gitlab-runners-fleet"

  vpc_id    = data.aws_vpc.vpc.id
  subnet_id = data.aws_subnets.example_subnets.ids[0]

  iam_permissions_boundary = "POLICY-PERMISSION-BOUNDARY"

  runner_gitlab = {
    url            = "https://example.mycompany.com/"
    runner_version = "16.10.0"
    preregistered_runner_token_ssm_parameter_name = "example-gitlab-runners-fleet-preregistered-token"
  }

  runner_manager = {
    maximum_concurrent_jobs = 10
  }

  runner_instance = {
    name = "gitlab-run"
    root_device_config = {
      volume_size = 100
      volume_type = "gp3"
    }
    ssm_access = true
  }
  runner_worker = {
    ssm_access            = true
    max_jobs              = 10
    request_concurrency   = 10
    type                  = "docker-autoscaler"
    environment_variables = ["AWS_REGION=us-west-2", "AWS_SDK_LOAD_CONFIG=true", "DOCKER_AUTH_CONFIG={\"auths\":{\"https://index.docker.io/v1/\":{\"auth\":\"${var.docker_auth_token}\"}}, \"credHelpers\": {\"${data.aws_caller_identity.current.account_id}.dkr.ecr.us-west-2.amazonaws.com\": \"ecr-login\"}}"]
  }

  runner_worker_docker_autoscaler_instance = {
    root_size   = 100
    volume_type = "gp3"
    monitoring  = true
  }

  runner_worker_docker_autoscaler_role = {
    policy_arns = ["arn:aws:iam::1111111111:policy/somepolicy", "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess"]
  }
  runner_worker_docker_autoscaler_asg = {
    on_demand_percentage_above_base_capacity = 0
    spot_allocation_strategy                 = "on-demand-price"
    enable_mixed_instance_policy             = true
    idle_time                                = 600
    subnet_ids                               = data.aws_subnets.example_subnets.ids
    types                                    = ["c5.large", "c5.xlarge", "c5.2xlarge", "c5.4xlarge"]
    volume_type                              = "gp3"
    private_address_only                     = true
    ebs_optimized                            = true
    root_size                                = 100
    sg_ingresses = [
      {
        description = "Allow all traffic within VPC and across local (TEST PURPOSE)"
        from_port   = 0
        to_port     = 65535
        protocol    = "tcp"
        cidr_blocks = ["10.0.0.0/8"]
      }
    ]
  }

  runner_worker_docker_options = {
    volumes = ["/cache", "/var/run/docker.sock:/var/run/docker.sock"]
  }

  runner_worker_docker_autoscaler = {
    connector_config_user = "ubuntu"
  }

  runner_ami_owners = ["2222222222"]

  runner_ami_filter = {
    "tag:Name" = ["example-amazon-linux-ami"]
  }

  runner_worker_docker_autoscaler_ami_owners = ["1111111111"]
  runner_worker_docker_autoscaler_ami_filter = {
    "tag:Name" = ["example-ubuntu-ami"]
  }

  runner_worker_docker_autoscaler_autoscaling_options = [
    {
        periods = ["* * * * *"]
        timezone = "UTC"
        idle_count = 1
        idle_time = "600s"
        scale_factor = 2
    }
  ]
  debug = {
    trace_runner_user_data = true
    write_runner_config_to_file = true
    write_runner_user_data_to_file = true

  }

}
  1. Check pipeline invoked to use this Runner and you will see following error -
ERROR: Failed to remove network for build 
ERROR: Preparation failed: preparing environment: dial ssh: after retrying 30 times: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
  1. When you login to both Runner Manager and Worker and look into logs, you will find "key not found" error when manager is trying to connect to worker - (Please note authorized_keys file has runner-worker-key public-key added, but I believe keypair in the runner manager from which ssh happens is missing)
sshd[3781]: debug1: trying public key file /home/ubuntu/.ssh/authorized_keys
sshd[3781]: debug1: fd 9 clearing O_NONBLOCK
sshd[3781]: debug2: key not found

Expected behavior

Runner Manager has to connect to Worker without errors and run the job

chovdary123 avatar Aug 15 '24 18:08 chovdary123