terraform-aws-gitlab-runner icon indicating copy to clipboard operation
terraform-aws-gitlab-runner copied to clipboard

ERROR: Preparation failed: exit status 2

Open jimmy-outschool opened this issue 3 years ago • 13 comments
trafficstars

The following error occurs rather regularly during large scale up events. When something like 30-80 new jobs appear beyond the current scale of runners many of the jobs will fail with this error. Retrying the jobs succeeds, but it seems like something is not waiting long enough for the new machines to start up.

I am unclear on whether this is the primary runner's fault, docker machine, or EC2. I have looked around at Gitlab runner code and looked for a timeout to increase, but haven't come across a useful one.

If this is out of the project scope that's fine, but if you have any insight into how to avoid this that would be appreciated.

Running with gitlab-runner 14.0.1 (c1edb478)
  on docker-default sgLT1ihz
section_start:1645715193:resolve_secrets
Resolving secrets
section_end:1645715193:resolve_secrets
section_start:1645715193:prepare_executor
Preparing the "docker+machine" executor
ERROR: Preparation failed: exit status 2
Will be retried in 3s ...
ERROR: Preparation failed: can't connect
Will be retried in 3s ...
ERROR: Preparation failed: exit status 2
Will be retried in 3s ...
section_end:1645715246:prepare_executor
ERROR: Job failed (system failure): exit status 2

jimmy-outschool avatar Feb 24 '22 15:02 jimmy-outschool

Tracked this down to primary instance t3.micro exhausting memory when scaling up rapidly. Upgrading to a t3.small resolved. This might be worth documenting along the lines of depending on your job load a larger primary instance may be needed. If you see, Preparation failed ...

jimmy-outschool avatar Mar 03 '22 17:03 jimmy-outschool

We are seeing a similar intermittent error before and after migrating from a t3.micro to t3.small

Running with gitlab-runner 14.10.0 (c6bb62f6)
  on default-auto DyJk7Eeo
Resolving secrets 00:00
Preparing the "docker+machine" executor 00:49
ERROR: Preparation failed: exit status 1
Will be retried in 3s ...
ERROR: Preparation failed: exit status 1
Will be retried in 3s ...
ERROR: Preparation failed: exit status 1
Will be retried in 3s ...
ERROR: Job failed (system failure): exit status 1

gaslitbytech avatar Apr 27 '22 04:04 gaslitbytech

To make sure I'm clear, I am talking about the primary instance instance_type not docker_machine_instance_type. Also that is exist status 1, which if I recall is a slightly different issue.

If you can pull up the resource status (like memory) on the primary instance and the logs for it.

jimmy-outschool avatar Apr 27 '22 14:04 jimmy-outschool

Seem to have plenty of memory

Screenshot from 2022-04-28 14-05-45

gaslitbytech avatar Apr 29 '22 00:04 gaslitbytech

I work with @tourdownunder , the problem ended up being due to AWS Spot Limits

jamesmstone avatar May 24 '22 09:05 jamesmstone

@jamesmstone can we close this issue?

npalm avatar Jul 20 '22 21:07 npalm

@npalm Nope, I'm facing the same issue. My runners are working well with the m4.large instance type, but when I'm increasing the AWS EC2 instance type to c7g.2xlarge for example I'm facing the same issue. ERROR: Preparation failed: exit status 1 Will be retried in 3s ... ERROR: Preparation failed: exit status 1 Will be retried in 3s ... ERROR: Preparation failed: exit status 1 Will be retried in 3s ... ERROR: Job failed (system failure): exit status 1

BehbudSh avatar Oct 04 '22 16:10 BehbudSh

And I'm not using AWS Spot instances

BehbudSh avatar Oct 04 '22 16:10 BehbudSh

But what size is the primary, not the workers? That is where the issue originates.

jimmy-outschool avatar Oct 04 '22 16:10 jimmy-outschool

docker_machine_instance_type = "c7g.2xlarge" instance_type = "t3.medium"

BehbudSh avatar Oct 04 '22 16:10 BehbudSh

This is the config that I'm using for Runner itself and for On demand runners

BehbudSh avatar Oct 04 '22 16:10 BehbudSh

Is there anything that I'm missing?)

BehbudSh avatar Oct 04 '22 16:10 BehbudSh

I missed exit status 1 not exit status 2 which are different reasons if I recall.

jimmy-outschool avatar Oct 04 '22 17:10 jimmy-outschool

I work with @tourdownunder , the problem ended up being due to AWS Spot Limits

Can you remember what the issue was and how you solved it? I'm having the same problem.

mark-webster-catalyst avatar Dec 14 '22 14:12 mark-webster-catalyst

We ended up reducing the rate we started new instances at to align with AWS default limits


From: mark-webster-catalyst @.> Sent: Thursday, December 15, 2022 1:32:14 AM To: npalm/terraform-aws-gitlab-runner @.> Cc: James Stone @.>; Mention @.> Subject: Re: [npalm/terraform-aws-gitlab-runner] ERROR: Preparation failed: exit status 2 (Issue #447)

I work with @tourdownunderhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftourdownunder&data=05%7C01%7C%7C6723c507d278409301fe08dadddfff66%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638066251386143249%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=th94k4b00WsQr9yX2nYIXi5nlAsEeQSnnHCNNV%2BOBBY%3D&reserved=0 , the problem ended up being due to AWS Spot Limits

Can you remember what the issue was and how you solved it? I'm having the same problem.

— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnpalm%2Fterraform-aws-gitlab-runner%2Fissues%2F447%23issuecomment-1351504195&data=05%7C01%7C%7C6723c507d278409301fe08dadddfff66%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638066251386143249%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=rF2slYYc0XIQlSccMOMRo%2BfesFSRzs81j2orWvrG1KE%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABAC7JLW7OLNUAFVTHEEBKTWNHK65ANCNFSM5PHYV23Q&data=05%7C01%7C%7C6723c507d278409301fe08dadddfff66%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638066251386299468%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q0UpeUVLBR2DlOBpEjX8nHm4d1a7hX%2BPTiuLCuHwVCI%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

jamesmstone avatar Dec 14 '22 20:12 jamesmstone

We ended up reducing the rate we started new instances at to align with AWS default limits ________________________________ From: mark-webster-catalyst @.> Sent: Thursday, December 15, 2022 1:32:14 AM To: npalm/terraform-aws-gitlab-runner @.> Cc: James Stone @.>; Mention @.> Subject: Re: [npalm/terraform-aws-gitlab-runner] ERROR: Preparation failed: exit status 2 (Issue #447) I work with @tourdownunderhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftourdownunder&data=05%7C01%7C%7C6723c507d278409301fe08dadddfff66%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638066251386143249%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=th94k4b00WsQr9yX2nYIXi5nlAsEeQSnnHCNNV%2BOBBY%3D&reserved=0 , the problem ended up being due to AWS Spot Limits Can you remember what the issue was and how you solved it? I'm having the same problem. — Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnpalm%2Fterraform-aws-gitlab-runner%2Fissues%2F447%23issuecomment-1351504195&data=05%7C01%7C%7C6723c507d278409301fe08dadddfff66%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638066251386143249%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=rF2slYYc0XIQlSccMOMRo%2BfesFSRzs81j2orWvrG1KE%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABAC7JLW7OLNUAFVTHEEBKTWNHK65ANCNFSM5PHYV23Q&data=05%7C01%7C%7C6723c507d278409301fe08dadddfff66%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638066251386299468%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q0UpeUVLBR2DlOBpEjX8nHm4d1a7hX%2BPTiuLCuHwVCI%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

@jamesmstone This is very helpful, a couple questions:

  1. Did this error start for you recently? It seems like there's more activity on this issue at a time when we are also experience this error out of no where. Coincidence? a. If "yes" to the above, do you have a sense of what underlying dependency changed to cause this problem?
  2. How did you limit the rate at which new instances are set? Can you give an example of your settings?

Thanks in advance!

tnightengale avatar Dec 15 '22 00:12 tnightengale

To clarify, I was also experiencing the exit status 1 error, not exit status 2.

For anyone who finds this, in my case it was invalid characters in the overrides block that caused it, after upgrading the module to 5.5.0. I had this block:

overrides = {
    name_docker_machine_runners = "gitlab_runner_spot_instance" # Underscores here caused the issue
  }

And I found this in the Cloud Watch Log Group, after attempting to run a pipeline on the runner:

Dec 15 00:45:04 ip-172-31-5-167 gitlab-runner: 
{
    "driver": "amazonec2",
    "level": "error",
    "msg": "Error creating machine: Invalid hostname specified. Allowed hostname chars are: 0-9a-zA-Z . -",
    "name": "runner-u-ca1k6x-gitlab_runner_spot_instance-1671065104-4e0d8727",
    "operation": "create",
    "time": "2022-12-15T00:45:04Z"
}

Removing the invalid characters allowed the instances to be created without errors.

tnightengale avatar Dec 15 '22 01:12 tnightengale

We ended up reducing the rate we started new instances at to align with AWS default limits

@tnightengale How did you do that?

kayman-mk avatar Jan 03 '23 10:01 kayman-mk

@kayman-mk . We reduced the number of runners_concurrent (or was it runners_limit) from 20 to 10. Though we could have (and may in the future) Request a Spot Instance limit increase as per Spot Instance limits so we can use more.

gaslitbytech avatar Jan 05 '23 02:01 gaslitbytech

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days.

github-actions[bot] avatar Mar 17 '23 02:03 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

github-actions[bot] avatar Apr 02 '23 02:04 github-actions[bot]