ecs-run-task
ecs-run-task copied to clipboard
Containers that fail with no reason causes a panic
See https://github.com/buildkite/ecs-run-task/issues/25#issuecomment-551333825 for more background.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x895b85]
goroutine 1 [running]:
github.com/buildkite/ecs-run-task/runner.writeContainerFinishedMessage(0xb226e0, 0xc00009c010, 0xc0002161c0, 0xc0001bd9e0, 0xc0001b82a0, 0x3a, 0x0)
/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/runner/runner.go:261 +0x155
github.com/buildkite/ecs-run-task/runner.(*Runner).Run(0xc0000ac0e0, 0xb226e0, 0xc00009c010, 0x0, 0x1)
/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/runner/runner.go:219 +0x12dd
main.main.func1(0xc0000e8580, 0x0, 0x0)
/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/main.go:115 +0x625
github.com/urfave/cli.HandleAction(0x93db80, 0xa17268, 0xc0000e8580, 0x0, 0x0)
/Users/lachlan/go/pkg/mod/github.com/urfave/[email protected]/app.go:490 +0xc8
github.com/urfave/cli.(*App).Run(0xc00014cea0, 0xc0000ac000, 0xe, 0xe, 0x0, 0x0)
/Users/lachlan/go/pkg/mod/github.com/urfave/[email protected]/app.go:264 +0x57c
main.main()
/Users/lachlan/go/src/github.com/buildkite/ecs-run-task/main.go:125 +0x8c3
From the AWS console, this looks like a case where ECS doesn't even get to the point of launching a container, so we might be able to fallback to the ecs.Task.StoppedReason
.
Here is some more info around this particular panic.
Output from aws ecs describe-tasks --tasks REDACTED --cluster REDACTED
{
"tasks": [{
"taskArn": "arn:aws:ecs:us-east-1:REDACTED:task/REDACTED",
"clusterArn": "arn:aws:ecs:us-east-1:REDACTED:cluster/REDACTED",
"taskDefinitionArn": "arn:aws:ecs:us-east-1:REDACTED:task-definition/deREDACTED:REDACTED",
"overrides": {
"containerOverrides": [{
"name": "app",
"command": [
"REDACTED"
],
"environment": [
]
}]
},
"lastStatus": "STOPPED",
"desiredStatus": "STOPPED",
"cpu": "256",
"memory": "512",
"containers": [{
"containerArn": "arn:aws:ecs:us-east-1:REDACTED:container/REDACTED",
"taskArn": "arn:aws:ecs:us-east-1:REDACTED:task/REDACTED",
"name": "app",
"lastStatus": "STOPPED",
"networkInterfaces": [{
"attachmentId": "REDACTED",
"privateIpv4Address": "REDACTED"
}],
"healthStatus": "UNKNOWN",
"cpu": "0"
}],
"version": 4,
"stoppedReason": "Timeout waiting for network interface provisioning to complete.",
"connectivity": "CONNECTED",
"connectivityAt": 1573233270.396,
"createdAt": 1573233066.948,
"stoppingAt": 1573233252.398,
"stoppedAt": 1573233282.065,
"group": "family:REDACTED",
"launchType": "FARGATE",
"platformVersion": "1.3.0",
"attachments": [{
"id": "REDACTED",
"type": "ElasticNetworkInterface",
"status": "DELETED",
"details": [{
"name": "subnetId",
"value": "REDACTED"
},
{
"name": "networkInterfaceId",
"value": "REDACTED"
},
{
"name": "macAddress",
"value": "REDACTED"
},
{
"name": "privateIPv4Address",
"value": "REDACTED"
}
]
}],
"healthStatus": "UNKNOWN",
"tags": []
}],
"failures": []
}
Not sure, but it's possible I have a fix - testing it now.
When waiting for the task to finish there is a built in max attempts mechanism (which is 100 by default) and a built in delay (set for 1 minute by default). So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".
You can try changing svc.WaitUntilTasksStopped to
err = svc.WaitUntilTasksStoppedWithContext(
ctx,
&ecs.DescribeTasksInput{
Cluster: aws.String(r.Cluster),
Tasks: taskARNs,
},
// >>>>>>>>>> THESE <<<<<<<<<
request.WithWaiterMaxAttempts(1),
request.WithWaiterDelay(func(attempt int) time.Duration {
return time.Second * 1
}),
// >>>>>>>>>>>> REPRODUCE THE ERROR <<<<<<<<<<<<
)
And it will instantly throw the error above.
So - it seems the fix would be to provide an outside "delay" and "max-attempts" params, as well as a longer default value.
@Eli-Goldberg have you tried out the PR I submitted? Does it fix your issue? https://github.com/buildkite/ecs-run-task/pull/27
/cc @pda
So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".
I can confirm that -- seems like there's no way to run an ECS task that takes more than 10 mins :(
So it seems any job that takes more than 10 minutes will get a "ResourceNotReady: exceeded wait attempts".
I can confirm that -- seems like there's no way to run an ECS task that takes more than 10 mins :(
I have a tested fix, will open a PR today
Hey @Eli-Goldberg did you ever land the fix for this issue? I seem to be hitting this issue reasonably frequently.
Yeah. sorry, forgot to open a pr. Ill do that in a bit :)
On Tue, Jun 2, 2020, 08:53 Chris Campbell [email protected] wrote:
Hey @Eli-Goldberg https://github.com/Eli-Goldberg did you ever land the fix for this issue? I seem to be hitting this issue reasonably frequently.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/buildkite/ecs-run-task/issues/26#issuecomment-637295118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESSMXGWORCF753PCUGDB4LRUSHWPANCNFSM4JKZRC5A .
Awesome! Thanks :)
I've opened a pr https://github.com/buildkite/ecs-run-task/pull/35.