terraform-aws-gitlab-runner icon indicating copy to clipboard operation
terraform-aws-gitlab-runner copied to clipboard

job error.

Open nomopo45 opened this issue 2 years ago • 15 comments

Hello,

I just tried this module recently, i managed to create my EC2 instance and automatically register on Gitlab.

The problem is each time i try to run a job (for now it's just a hello world job) it will fail with an error :

Running with gitlab-runner 14.8.3 (16ae0625)
  on uat-runner okQssjvP
Preparing the "docker+machine" executor
00:35
Using Docker executor with image docker:18.03.1-ce ...
Pulling docker image docker:18.03.1-ce ...
Using docker image sha[2](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/13#L2)56:7c1527e8e59b80ed4[3](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/13#L3)f6c[4](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/13#L4)2[5](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/13#L5)c03cd9cb4[6](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/13#L6)b[8](https://gitlab.dev.woorton.lol/gitlab-instance-930d8b8f/conan-test/-/jobs/13#L8)73d7db02d0c47db01b5f58839c8d for docker:18.03.1-ce with digest docker@sha256:bdeaddc74da33d02b2e7064e9050cd1aaa43c472341688cb1402a027f3f5efa7 ...
Preparing environment
00:17
ERROR: Job failed: prepare environment: exit code 255. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

OR

Running with gitlab-runner 14.8.3 (16ae0625)
  on uat-runner okQssjvP
Preparing the "docker+machine" executor
01:[2](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/12#L2)9
Using Docker executor with image docker:18.0[3](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/12#L3).1-ce ...
ERROR: Job failed: adding cache volume: set volume permissions: running permission container "[4](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/12#L4)0f2ea08d785[6](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/12#L6)8454607935d8a5a4cd9936fc0c7ce5a2704f54a1dda5de73513" for volume "runner-okqssjvp-project-2-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": waiting for permission container to finish: exit code 255

OR

Running with gitlab-runner 14.8.3 (16ae0625)
  on uat-runner okQssjvP
Preparing the "docker+machine" executor
01:47
Using Docker executor with image docker:18.03.1-ce ...
Pulling docker image docker:18.03.1-ce ...
Using docker image sha[2](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/7#L2)56:7c1527e8e59b80ed4[3](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/7#L3)f6c[4](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/7#L4)2[5](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/7#L5)c03cd9cb4[6](https://gitlab.domain.com/gitlab-instance-930d8b8f/conan-test/-/jobs/7#L6)b873d7db02d0c47db01b5f58839c8d for docker:18.03.1-ce with digest docker@sha256:bdeaddc74da33d02b2e7064e9050cd1aaa43c472341688cb1402a027f3f5efa7 ...
Preparing environment
00:16
Getting source from Git repository
00:16
Executing "step_script" stage of the job script
00:01
Using docker image sha256:7c1527e8e59b80ed43f6c425c03cd9cb46b873d7db02d0c47db01b5f58839c8d for docker:18.03.1-ce with digest docker@sha256:bdeaddc74da33d02b2e7064e9050cd1aaa43c472341688cb1402a027f3f5efa7 ...
Cleaning up project directory and file based variables
00:16
ERROR: Job failed: exit code 139

here is the config i use :

    aws_region  = local.common_vars.inputs.region
    environment = local.common_vars.inputs.environment
    name = "${local.common_vars.inputs.projet}-${replace(local.common_vars.inputs.region,"-","")}-${local.common_vars.inputs.environment}-runner"

    key_name = local.common_vars.inputs.key

    vpc_id                   = dependency.vpc.outputs.vpc_id
    subnet_ids_gitlab_runner = dependency.vpc.outputs.public_subnets
    subnet_id_runners        = element(dependency.vpc.outputs.public_subnets, 0)

    runners_name       = "${local.common_vars.inputs.projet}-${replace(local.common_vars.inputs.region,"-","")}-${local.common_vars.inputs.environment}-runner"
    runners_gitlab_url = "https://${dependency.gitlab.outputs.hostname}"

    gitlab_runner_registration_config = {
      registration_token = "toeken"
      tag_list           = "docker"
      description        = "runner ec2 default"
      locked_to_project  = "true"
      run_untagged       = "false"
      maximum_timeout    = "3600"
    }

Do you have any idea what could be the reason of those errors ? or where could i look for logs ?

nomopo45 avatar Jun 08 '22 15:06 nomopo45

Would be nice to see the Terraform plan and the definition of the hello world job.

gitlab.domain.com looks weired in the logs.

kayman-mk avatar Jun 08 '22 20:06 kayman-mk

This seems to be some incompatibility between the latest versions of Docker, Docker Machine, Ubuntu (maybe) and GitLab Runner.

Details here: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/26564

JulianCBC avatar Jun 08 '22 23:06 JulianCBC

This fix seems to have fixed it for me: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/26564#note_977593368

JulianCBC avatar Jun 09 '22 00:06 JulianCBC

Hey @JulianCBC

It seems a kernel bug. That's the logs from my docker-machine host created by the agent:

[Thu Jun  9 01:01:05 2022] Initializing XFRM netlink socket
[Thu Jun  9 01:01:12 2022] docker0: port 1(veth8a9a81d) entered blocking state
[Thu Jun  9 01:01:12 2022] docker0: port 1(veth8a9a81d) entered disabled state
[Thu Jun  9 01:01:12 2022] device veth8a9a81d entered promiscuous mode
[Thu Jun  9 01:01:13 2022] eth0: renamed from veth5d0d453
[Thu Jun  9 01:01:13 2022] IPv6: ADDRCONF(NETDEV_CHANGE): veth8a9a81d: link becomes ready
[Thu Jun  9 01:01:13 2022] docker0: port 1(veth8a9a81d) entered blocking state
[Thu Jun  9 01:01:13 2022] docker0: port 1(veth8a9a81d) entered forwarding state
[Thu Jun  9 01:01:13 2022] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
[Thu Jun  9 01:01:13 2022] ------------[ cut here ]------------
[Thu Jun  9 01:01:13 2022] kernel BUG at include/linux/fs.h:3104!
[Thu Jun  9 01:01:13 2022] invalid opcode: 0000 [#1] SMP NOPTI
[Thu Jun  9 01:01:13 2022] CPU: 1 PID: 929 Comm: gitlab-runner-h Not tainted 5.13.0-1028-aws #31~20.04.1-Ubuntu
[Thu Jun  9 01:01:13 2022] Hardware name: Amazon EC2 t3a.medium/, BIOS 1.0 10/16/2017
[Thu Jun  9 01:01:13 2022] RIP: 0010:__fput+0x247/0x250
[Thu Jun  9 01:01:13 2022] Code: 00 48 85 ff 0f 84 8b fe ff ff f6 c7 40 0f 85 82 fe ff ff e8 ab 38 00 00 e9 78 fe ff ff 4c 89 f7 e8 2e 8802 00 e9 b5 fe ff ff <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 31 db 48
[Thu Jun  9 01:01:13 2022] RSP: 0018:ffffb7f180bebe30 EFLAGS: 00010246
[Thu Jun  9 01:01:13 2022] RAX: 0000000000000000 RBX: 00000000000a801d RCX: ffffa0954341c000
[Thu Jun  9 01:01:13 2022] RDX: ffffa09545973280 RSI: 0000000000000001 RDI: 0000000000000000
[Thu Jun  9 01:01:13 2022] RBP: ffffb7f180bebe58 R08: ffffa09545b5eb40 R09: ffffa0954fbc0570
[Thu Jun  9 01:01:13 2022] R10: ffffb7f180bebe30 R11: ffffa0954fdea510 R12: ffffa0954fdea500
[Thu Jun  9 01:01:13 2022] R13: ffffa0954fbc0570 R14: ffffa095459732a0 R15: ffffa0955f481900
[Thu Jun  9 01:01:13 2022] FS:  0000000000000000(0000) GS:ffffa09578d00000(0000) knlGS:0000000000000000
[Thu Jun  9 01:01:13 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jun  9 01:01:13 2022] CR2: 00007ffdea280f79 CR3: 000000011fd7a000 CR4: 00000000003506e0
[Thu Jun  9 01:01:13 2022] Call Trace:
[Thu Jun  9 01:01:13 2022]  <TASK>
[Thu Jun  9 01:01:13 2022]  ____fput+0xe/0x10
[Thu Jun  9 01:01:13 2022]  task_work_run+0x70/0xb0
[Thu Jun  9 01:01:13 2022]  exit_to_user_mode_prepare+0x1b5/0x1c0
[Thu Jun  9 01:01:13 2022]  syscall_exit_to_user_mode+0x27/0x50
[Thu Jun  9 01:01:13 2022]  do_syscall_64+0x6e/0xb0
[Thu Jun  9 01:01:13 2022]  ? do_syscall_64+0x6e/0xb0
[Thu Jun  9 01:01:13 2022]  ? irqentry_exit_to_user_mode+0x9/0x20
[Thu Jun  9 01:01:13 2022]  ? irqentry_exit+0x19/0x30
[Thu Jun  9 01:01:13 2022]  ? sysvec_reschedule_ipi+0x7e/0xf0
[Thu Jun  9 01:01:13 2022]  ? asm_sysvec_reschedule_ipi+0xa/0x20
[Thu Jun  9 01:01:13 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[Thu Jun  9 01:01:13 2022] RIP: 0033:0x466100
[Thu Jun  9 01:01:13 2022] Code: Unable to access opcode bytes at RIP 0x4660d6.
[Thu Jun  9 01:01:13 2022] RSP: 002b:00007ffdea280dd0 EFLAGS: 00000200 ORIG_RAX: 000000000000003b
[Thu Jun  9 01:01:13 2022] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[Thu Jun  9 01:01:13 2022] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[Thu Jun  9 01:01:13 2022] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[Thu Jun  9 01:01:13 2022] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[Thu Jun  9 01:01:13 2022] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[Thu Jun  9 01:01:13 2022]  </TASK>
[Thu Jun  9 01:01:13 2022] Modules linked in: veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc aufs overlay nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel psmouse input_leds aesni_intel crypto_simd cryptd serio_raw ena parport_pc parport sch_fq_codel ipmi_devintf ipmi_msghandler msr drm ip_tables x_tables autofs4
[Thu Jun  9 01:01:13 2022] ---[ end trace 87ca6f1d500d57c3 ]---
[Thu Jun  9 01:01:13 2022] RIP: 0010:__fput+0x247/0x250
[Thu Jun  9 01:01:13 2022] Code: 00 48 85 ff 0f 84 8b fe ff ff f6 c7 40 0f 85 82 fe ff ff e8 ab 38 00 00 e9 78 fe ff ff 4c 89 f7 e8 2e 8802 00 e9 b5 fe ff ff <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 31 db 48
[Thu Jun  9 01:01:13 2022] RSP: 0018:ffffb7f180bebe30 EFLAGS: 00010246
[Thu Jun  9 01:01:13 2022] RAX: 0000000000000000 RBX: 00000000000a801d RCX: ffffa0954341c000
[Thu Jun  9 01:01:13 2022] RDX: ffffa09545973280 RSI: 0000000000000001 RDI: 0000000000000000
[Thu Jun  9 01:01:13 2022] RBP: ffffb7f180bebe58 R08: ffffa09545b5eb40 R09: ffffa0954fbc0570
[Thu Jun  9 01:01:13 2022] R10: ffffb7f180bebe30 R11: ffffa0954fdea510 R12: ffffa0954fdea500
[Thu Jun  9 01:01:13 2022] R13: ffffa0954fbc0570 R14: ffffa095459732a0 R15: ffffa0955f481900
[Thu Jun  9 01:01:13 2022] FS:  0000000000000000(0000) GS:ffffa09578d00000(0000) knlGS:0000000000000000
[Thu Jun  9 01:01:13 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jun  9 01:01:13 2022] CR2: 00000000004660d6 CR3: 000000011fd7a000 CR4: 00000000003506e0
[Thu Jun  9 01:01:13 2022] docker0: port 1(veth8a9a81d) entered disabled state
[Thu Jun  9 01:01:13 2022] veth5d0d453: renamed from eth0
[Thu Jun  9 01:01:13 2022] docker0: port 1(veth8a9a81d) entered disabled state
[Thu Jun  9 01:01:13 2022] device veth8a9a81d left promiscuous mode
[Thu Jun  9 01:01:13 2022] docker0: port 1(veth8a9a81d) entered disabled state
[Thu Jun  9 01:01:19 2022] loop5: detected capacity change from 0 to 8

Kernel and Docker version:

$ uname -romi
5.13.0-1028-aws x86_64 x86_64 GNU/Linux
$ docker --version
Docker version 20.10.17, build 100c701

In my tests, the runner agent couldn't connect properly to the docker-machine host when a Job was running:

NAME                                         ACTIVE   DRIVER      STATE     URL                        SWARM   DOCKER    ERRORS
runner-xqnnexbf-runner-1654735954-094efa94   -        amazonec2   Running   tcp://10.48.102.185:2376           Unknown   Unable to query docker version: Cannot connect to the docker engine endpoint
sh-4.2$ sudo docker-machine ls
NAME                                         ACTIVE   DRIVER      STATE     URL                        SWARM   DOCKER    ERRORS
runner-xqnnexbf-runner-1654735954-094efa94   -        amazonec2   Running   tcp://10.48.102.185:2376           Unknown   Unable to query docker version: Cannot connect to the docker engine endpoint

Once the Job ends, it could connect without issues:

sh-4.2$ sudo docker-machine ls
NAME                                         ACTIVE   DRIVER      STATE     URL                        SWARM   DOCKER      ERRORS
runner-xqnnexbf-runner-1654735954-094efa94   -        amazonec2   Running   tcp://10.48.102.185:2376           v20.10.17

So before viewing your reply, I changed the ubuntu version to 22.04:

   runner_ami_filter = {
-    name = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
+    name = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
   }

It worked as well 👍🏼

hiago-miguel avatar Jun 09 '22 01:06 hiago-miguel

Thanks @hiago-miguel that's very illuminating!

In the interests of adding to my knowledge on how to manage this, where did you get those kernel logs?

JulianCBC avatar Jun 09 '22 01:06 JulianCBC

@kayman-mk @npalm should we just update the AMI used by the instances Docker Machine spins up to 22.04? Is there any specific reason why we're still using 20.04? (I do note that the default AMIs used by GitLab's Docker Machine fork are 20.04)

JulianCBC avatar Jun 09 '22 01:06 JulianCBC

@JulianCBC the Kernel logs are from the docker machine host created by the runner agent.

image

I accessed the host using the aws ssm. This can be done on the AWS console:

  • Select the ec2 instance, then click on Action, and then Connect

image

hiago-miguel avatar Jun 09 '22 01:06 hiago-miguel

Would be nice to see the Terraform plan and the definition of the hello world job.

gitlab.domain.com looks weired in the logs.

I just changed it to hide my real domain. In the unchanged logs i can see my real domain

nomopo45 avatar Jun 09 '22 07:06 nomopo45

Hey @JulianCBC

It seems a kernel bug. That's the logs from my docker-machine host created by the agent:

[Thu Jun  9 01:01:05 2022] Initializing XFRM netlink socket
[Thu Jun  9 01:01:12 2022] docker0: port 1(veth8a9a81d) entered blocking state
[Thu Jun  9 01:01:12 2022] docker0: port 1(veth8a9a81d) entered disabled state
[Thu Jun  9 01:01:12 2022] device veth8a9a81d entered promiscuous mode
[Thu Jun  9 01:01:13 2022] eth0: renamed from veth5d0d453
[Thu Jun  9 01:01:13 2022] IPv6: ADDRCONF(NETDEV_CHANGE): veth8a9a81d: link becomes ready
[Thu Jun  9 01:01:13 2022] docker0: port 1(veth8a9a81d) entered blocking state
[Thu Jun  9 01:01:13 2022] docker0: port 1(veth8a9a81d) entered forwarding state
[Thu Jun  9 01:01:13 2022] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
[Thu Jun  9 01:01:13 2022] ------------[ cut here ]------------
[Thu Jun  9 01:01:13 2022] kernel BUG at include/linux/fs.h:3104!
[Thu Jun  9 01:01:13 2022] invalid opcode: 0000 [#1] SMP NOPTI
[Thu Jun  9 01:01:13 2022] CPU: 1 PID: 929 Comm: gitlab-runner-h Not tainted 5.13.0-1028-aws #31~20.04.1-Ubuntu
[Thu Jun  9 01:01:13 2022] Hardware name: Amazon EC2 t3a.medium/, BIOS 1.0 10/16/2017
[Thu Jun  9 01:01:13 2022] RIP: 0010:__fput+0x247/0x250
[Thu Jun  9 01:01:13 2022] Code: 00 48 85 ff 0f 84 8b fe ff ff f6 c7 40 0f 85 82 fe ff ff e8 ab 38 00 00 e9 78 fe ff ff 4c 89 f7 e8 2e 8802 00 e9 b5 fe ff ff <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 31 db 48
[Thu Jun  9 01:01:13 2022] RSP: 0018:ffffb7f180bebe30 EFLAGS: 00010246
[Thu Jun  9 01:01:13 2022] RAX: 0000000000000000 RBX: 00000000000a801d RCX: ffffa0954341c000
[Thu Jun  9 01:01:13 2022] RDX: ffffa09545973280 RSI: 0000000000000001 RDI: 0000000000000000
[Thu Jun  9 01:01:13 2022] RBP: ffffb7f180bebe58 R08: ffffa09545b5eb40 R09: ffffa0954fbc0570
[Thu Jun  9 01:01:13 2022] R10: ffffb7f180bebe30 R11: ffffa0954fdea510 R12: ffffa0954fdea500
[Thu Jun  9 01:01:13 2022] R13: ffffa0954fbc0570 R14: ffffa095459732a0 R15: ffffa0955f481900
[Thu Jun  9 01:01:13 2022] FS:  0000000000000000(0000) GS:ffffa09578d00000(0000) knlGS:0000000000000000
[Thu Jun  9 01:01:13 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jun  9 01:01:13 2022] CR2: 00007ffdea280f79 CR3: 000000011fd7a000 CR4: 00000000003506e0
[Thu Jun  9 01:01:13 2022] Call Trace:
[Thu Jun  9 01:01:13 2022]  <TASK>
[Thu Jun  9 01:01:13 2022]  ____fput+0xe/0x10
[Thu Jun  9 01:01:13 2022]  task_work_run+0x70/0xb0
[Thu Jun  9 01:01:13 2022]  exit_to_user_mode_prepare+0x1b5/0x1c0
[Thu Jun  9 01:01:13 2022]  syscall_exit_to_user_mode+0x27/0x50
[Thu Jun  9 01:01:13 2022]  do_syscall_64+0x6e/0xb0
[Thu Jun  9 01:01:13 2022]  ? do_syscall_64+0x6e/0xb0
[Thu Jun  9 01:01:13 2022]  ? irqentry_exit_to_user_mode+0x9/0x20
[Thu Jun  9 01:01:13 2022]  ? irqentry_exit+0x19/0x30
[Thu Jun  9 01:01:13 2022]  ? sysvec_reschedule_ipi+0x7e/0xf0
[Thu Jun  9 01:01:13 2022]  ? asm_sysvec_reschedule_ipi+0xa/0x20
[Thu Jun  9 01:01:13 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[Thu Jun  9 01:01:13 2022] RIP: 0033:0x466100
[Thu Jun  9 01:01:13 2022] Code: Unable to access opcode bytes at RIP 0x4660d6.
[Thu Jun  9 01:01:13 2022] RSP: 002b:00007ffdea280dd0 EFLAGS: 00000200 ORIG_RAX: 000000000000003b
[Thu Jun  9 01:01:13 2022] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[Thu Jun  9 01:01:13 2022] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[Thu Jun  9 01:01:13 2022] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[Thu Jun  9 01:01:13 2022] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[Thu Jun  9 01:01:13 2022] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[Thu Jun  9 01:01:13 2022]  </TASK>
[Thu Jun  9 01:01:13 2022] Modules linked in: veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc aufs overlay nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel psmouse input_leds aesni_intel crypto_simd cryptd serio_raw ena parport_pc parport sch_fq_codel ipmi_devintf ipmi_msghandler msr drm ip_tables x_tables autofs4
[Thu Jun  9 01:01:13 2022] ---[ end trace 87ca6f1d500d57c3 ]---
[Thu Jun  9 01:01:13 2022] RIP: 0010:__fput+0x247/0x250
[Thu Jun  9 01:01:13 2022] Code: 00 48 85 ff 0f 84 8b fe ff ff f6 c7 40 0f 85 82 fe ff ff e8 ab 38 00 00 e9 78 fe ff ff 4c 89 f7 e8 2e 8802 00 e9 b5 fe ff ff <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 31 db 48
[Thu Jun  9 01:01:13 2022] RSP: 0018:ffffb7f180bebe30 EFLAGS: 00010246
[Thu Jun  9 01:01:13 2022] RAX: 0000000000000000 RBX: 00000000000a801d RCX: ffffa0954341c000
[Thu Jun  9 01:01:13 2022] RDX: ffffa09545973280 RSI: 0000000000000001 RDI: 0000000000000000
[Thu Jun  9 01:01:13 2022] RBP: ffffb7f180bebe58 R08: ffffa09545b5eb40 R09: ffffa0954fbc0570
[Thu Jun  9 01:01:13 2022] R10: ffffb7f180bebe30 R11: ffffa0954fdea510 R12: ffffa0954fdea500
[Thu Jun  9 01:01:13 2022] R13: ffffa0954fbc0570 R14: ffffa095459732a0 R15: ffffa0955f481900
[Thu Jun  9 01:01:13 2022] FS:  0000000000000000(0000) GS:ffffa09578d00000(0000) knlGS:0000000000000000
[Thu Jun  9 01:01:13 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jun  9 01:01:13 2022] CR2: 00000000004660d6 CR3: 000000011fd7a000 CR4: 00000000003506e0
[Thu Jun  9 01:01:13 2022] docker0: port 1(veth8a9a81d) entered disabled state
[Thu Jun  9 01:01:13 2022] veth5d0d453: renamed from eth0
[Thu Jun  9 01:01:13 2022] docker0: port 1(veth8a9a81d) entered disabled state
[Thu Jun  9 01:01:13 2022] device veth8a9a81d left promiscuous mode
[Thu Jun  9 01:01:13 2022] docker0: port 1(veth8a9a81d) entered disabled state
[Thu Jun  9 01:01:19 2022] loop5: detected capacity change from 0 to 8

Kernel and Docker version:

$ uname -romi
5.13.0-1028-aws x86_64 x86_64 GNU/Linux
$ docker --version
Docker version 20.10.17, build 100c701

In my tests, the runner agent couldn't connect properly to the docker-machine host when a Job was running:

NAME                                         ACTIVE   DRIVER      STATE     URL                        SWARM   DOCKER    ERRORS
runner-xqnnexbf-runner-1654735954-094efa94   -        amazonec2   Running   tcp://10.48.102.185:2376           Unknown   Unable to query docker version: Cannot connect to the docker engine endpoint
sh-4.2$ sudo docker-machine ls
NAME                                         ACTIVE   DRIVER      STATE     URL                        SWARM   DOCKER    ERRORS
runner-xqnnexbf-runner-1654735954-094efa94   -        amazonec2   Running   tcp://10.48.102.185:2376           Unknown   Unable to query docker version: Cannot connect to the docker engine endpoint

Once the Job ends, it could connect without issues:

sh-4.2$ sudo docker-machine ls
NAME                                         ACTIVE   DRIVER      STATE     URL                        SWARM   DOCKER      ERRORS
runner-xqnnexbf-runner-1654735954-094efa94   -        amazonec2   Running   tcp://10.48.102.185:2376           v20.10.17

So before viewing your reply, I changed the ubuntu version to 22.04:

   runner_ami_filter = {
-    name = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
+    name = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
   }

It worked as well 👍🏼

I did the change from 20.04 to 22.04 and it's working !

Thanks a lot for your help !

nomopo45 avatar Jun 09 '22 09:06 nomopo45

Been facing the same issue and solved it by updating to Ubuntu 22.04. But I don't think this issue should be closed yet since new users will be using 20.04 (default runner_ami_filter) and seeing the same bug.

ghost avatar Jun 09 '22 09:06 ghost

Interesting. Just checked my configuration. I am using the defaults and everything works. No problem. I am using version 5.0.2 of the module.

kayman-mk avatar Jun 19 '22 11:06 kayman-mk

@kayman-mk when was the last time you refreshed your configuration? The AMIs chosen are selected when the configuration is applied, not dynamically as jobs run.

It's also possible that the broken AMIs have been replaced with newer ones that don't have this issue.

Which AMI are you using?

JulianCBC avatar Jun 19 '22 11:06 JulianCBC

Good point, Julian. The Runner I checked was setup last Friday. AMI is ami-0929b2e28d090f63f. It was created June 14th.

kayman-mk avatar Jun 19 '22 11:06 kayman-mk

Weird. That's an Amazon Linux 2 AMI, is that what the spot instances the runner spins up are using?

JulianCBC avatar Jun 19 '22 12:06 JulianCBC

I relaunched yesterday and started getting tls certificate errors with 22.04. I haven't had a chance to investigate deeply, but pinning to an earlier AMI worked:

  runner_ami_filter = {
    image-id = ["ami-06bbbd4e89b66f400"] 
  }

internetstaff avatar Jun 23 '22 13:06 internetstaff

I just encountered the same problem with ami-06bbbd4e89b66f400, which has worked for 6+ months.

It seems to be resolved again with allowing this to pull a newer image:

  runner_ami_filter = {
    name = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }

internetstaff avatar Jan 27 '23 15:01 internetstaff

Why not using the default AMI? Never had problems with that.

kayman-mk avatar Jan 28 '23 10:01 kayman-mk

I close this issue due to missing feedback.

I do not remember where I read it, but you are strongly advised not to choose an AMI yourself. Stick with the default. There might be problems using different AMIs.

kayman-mk avatar Feb 15 '23 20:02 kayman-mk