ecs-run-task Containers that fail with no reason causes a panic

See https://github.com/buildkite/ecs-run-task/issues/25#issuecomment-551333825 for more background.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x895b85]

goroutine 1 [running]:
github.com/buildkite/ecs-run-task/runner.writeContainerFinishedMessage(0xb226e0, 0xc00009c010, 0xc0002161c0, 0xc0001bd9e0, 0xc0001b82a0, 0x3a, 0x0)
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/runner/runner.go:261 +0x155
github.com/buildkite/ecs-run-task/runner.(*Runner).Run(0xc0000ac0e0, 0xb226e0, 0xc00009c010, 0x0, 0x1)
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/runner/runner.go:219 +0x12dd
main.main.func1(0xc0000e8580, 0x0, 0x0)
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/main.go:115 +0x625
github.com/urfave/cli.HandleAction(0x93db80, 0xa17268, 0xc0000e8580, 0x0, 0x0)
	/Users/lachlan/go/pkg/mod/github.com/urfave/[email protected]/app.go:490 +0xc8
github.com/urfave/cli.(*App).Run(0xc00014cea0, 0xc0000ac000, 0xe, 0xe, 0x0, 0x0)
	/Users/lachlan/go/pkg/mod/github.com/urfave/[email protected]/app.go:264 +0x57c
main.main()
	/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/main.go:125 +0x8c3

From the AWS console, this looks like a case where ECS doesn't even get to the point of launching a container, so we might be able to fallback to the ecs.Task.StoppedReason.

Nov 08 '19 16:11 sherzberg

Here is some more info around this particular panic.

Output from aws ecs describe-tasks --tasks REDACTED --cluster REDACTED

{
	"tasks": [{
		"taskArn": "arn:aws:ecs:us-east-1:REDACTED:task/REDACTED",
		"clusterArn": "arn:aws:ecs:us-east-1:REDACTED:cluster/REDACTED",
		"taskDefinitionArn": "arn:aws:ecs:us-east-1:REDACTED:task-definition/deREDACTED:REDACTED",
		"overrides": {
			"containerOverrides": [{
				"name": "app",
				"command": [
					"REDACTED"
				],
				"environment": [

				]
			}]
		},
		"lastStatus": "STOPPED",
		"desiredStatus": "STOPPED",
		"cpu": "256",
		"memory": "512",
		"containers": [{
			"containerArn": "arn:aws:ecs:us-east-1:REDACTED:container/REDACTED",
			"taskArn": "arn:aws:ecs:us-east-1:REDACTED:task/REDACTED",
			"name": "app",
			"lastStatus": "STOPPED",
			"networkInterfaces": [{
				"attachmentId": "REDACTED",
				"privateIpv4Address": "REDACTED"
			}],
			"healthStatus": "UNKNOWN",
			"cpu": "0"
		}],
		"version": 4,
		"stoppedReason": "Timeout waiting for network interface provisioning to complete.",
		"connectivity": "CONNECTED",
		"connectivityAt": 1573233270.396,
		"createdAt": 1573233066.948,
		"stoppingAt": 1573233252.398,
		"stoppedAt": 1573233282.065,
		"group": "family:REDACTED",
		"launchType": "FARGATE",
		"platformVersion": "1.3.0",
		"attachments": [{
			"id": "REDACTED",
			"type": "ElasticNetworkInterface",
			"status": "DELETED",
			"details": [{
					"name": "subnetId",
					"value": "REDACTED"
				},
				{
					"name": "networkInterfaceId",
					"value": "REDACTED"
				},
				{
					"name": "macAddress",
					"value": "REDACTED"
				},
				{
					"name": "privateIPv4Address",
					"value": "REDACTED"
				}
			]
		}],
		"healthStatus": "UNKNOWN",
		"tags": []
	}],
	"failures": []
}

Nov 08 '19 17:11 sherzberg

Not sure, but it's possible I have a fix - testing it now.

When waiting for the task to finish there is a built in max attempts mechanism (which is 100 by default) and a built in delay (set for 1 minute by default). So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".

You can try changing svc.WaitUntilTasksStopped to

err = svc.WaitUntilTasksStoppedWithContext(
		ctx,
		&ecs.DescribeTasksInput{
			Cluster: aws.String(r.Cluster),
			Tasks:   taskARNs,
		},

                // >>>>>>>>>> THESE <<<<<<<<<
		request.WithWaiterMaxAttempts(1),
		request.WithWaiterDelay(func(attempt int) time.Duration {
			return time.Second * 1
		}),
                // >>>>>>>>>>>> REPRODUCE THE ERROR <<<<<<<<<<<<
	)

And it will instantly throw the error above.

So - it seems the fix would be to provide an outside "delay" and "max-attempts" params, as well as a longer default value.

Feb 16 '20 11:02 Eli-Goldberg

@Eli-Goldberg have you tried out the PR I submitted? Does it fix your issue? https://github.com/buildkite/ecs-run-task/pull/27

Feb 18 '20 03:02 sherzberg

/cc @pda

Feb 19 '20 03:02 lox

So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".

I can confirm that -- seems like there's no way to run an ECS task that takes more than 10 mins :(

Mar 11 '20 22:03 daroczig

So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".

I can confirm that -- seems like there's no way to run an ECS task that takes more than 10 mins :(

I have a tested fix, will open a PR today

Mar 12 '20 05:03 Eli-Goldberg

Hey @Eli-Goldberg did you ever land the fix for this issue? I seem to be hitting this issue reasonably frequently.

Jun 02 '20 05:06 dannymidnight

Yeah. sorry, forgot to open a pr. Ill do that in a bit :)

On Tue, Jun 2, 2020, 08:53 Chris Campbell [email protected] wrote:

Hey @Eli-Goldberg https://github.com/Eli-Goldberg did you ever land the fix for this issue? I seem to be hitting this issue reasonably frequently.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/buildkite/ecs-run-task/issues/26#issuecomment-637295118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESSMXGWORCF753PCUGDB4LRUSHWPANCNFSM4JKZRC5A .

Jun 02 '20 05:06 Eli-Goldberg

Awesome! Thanks :)

Jun 02 '20 06:06 dannymidnight

I've opened a pr https://github.com/buildkite/ecs-run-task/pull/35.

Jun 02 '20 09:06 Eli-Goldberg

ecs-run-task ecs-run-task copied to clipboard

Containers that fail with no reason causes a panic

ecs-run-task
ecs-run-task copied to clipboard