pulumi-awsx
pulumi-awsx copied to clipboard
Fargate service updating failing with timeout
I am using the following code to deploy container (some things with certificate, route53 dropped out to simplify )
const nlb = new awsx.elasticloadbalancingv2.NetworkLoadBalancer("some-nlb", {external: true });
const tg = nlb.createTargetGroup("some-tg", { port: 3000 });
const listener = tg.createListener("some-http-listener", { port: 80 });
const listener2 = tg.createListener('https-listener', {
certificateArn: certificate.arn,
loadBalancerArn: nlb.arn,
port: 443,
protocol: "TLS",
sslPolicy: "ELBSecurityPolicy-2016-08"
});
let service = new awsx.ecs.FargateService("some-app", {
desiredCount: 1,
taskDefinitionArgs: {
containers: {
myapp: {
image: awsx.ecs.Image.fromPath("app", "../"),
memory: 512,
portMappings: [listener],
},
},
},
});
It deploys correctly, but if I try to run it again, it stucks on the updating taskDefinition, and throw timeout error:
├─ awsx:x:ecs:FargateTaskDefinition
+- │ └─ aws:ecs:TaskDefinition replaced [diff: ~containerDefinitions]
└─ awsx:x:ecs:FargateService
~ └─ aws:ecs:Service **updating failed** [diff: ~taskDefinition]; 1 error
error: Plan apply failed: 1 error occurred:
* updating urn:pulumi:dev::drjones::awsx:x:ecs:FargateService$aws:ecs/service:Service::service: timeout while waiting for state to become 'true' (last state: 'false', timeout: 10m0s)
And it always crashes on process updating taskDefinition, after this, I have +1 running task, but not replacing existing. But my changes are successfully applied (which I made in app code). So I see the changes which I made in container, but script fail and I have 2 tasks not one.
After some time this second task is stopped on aws. But script on the next pulumi up
run tries to delete task from previous deployment.
Weird situation, which affects consistency, and makes it impossible to do updating correctly. Anyway it looks like some processing up to 10 min and then fail.
Also ran into this with the following similar error message:
pulumi:pulumi:Stack (blah-api-qa):
error: update failed
aws:ecs:Service (qa-blah):
error: Plan apply failed: 1 error occurred:
* updating urn:pulumi:qa::blah-api::awsx:x:ecs:FargateService$aws:ecs/service:Service::qa-blah: timeout while waiting for state to become 'true' (last state: 'false', timeout: 10m0s)
error: post-step event returned an error: failed to save snapshot: [403] The provided update token has expired.
Next pulumi up
run tries to delete task def. from previous deployment. And fails with the following:
error: the current deployment has 1 resource(s) with pending operations:
* urn:pulumi:qa::blah-api::awsx:x:ecs:FargateService$aws:ecs/service:Service::qa-blah, interrupted while updating
These resources are in an unknown state because the Pulumi CLI was interrupted while
waiting for changes to these resources to complete. You should confirm whether or not the
operations listed completed successfully by checking the state of the appropriate provider.
For example, if you are using AWS, you can confirm using the AWS Console.
Once you have confirmed the status of the interrupted operations, you can repair your stack
using 'pulumi stack export' to export your stack to a file. For each operation that succeeded,
remove that operation from the "pending_operations" section of the file. Once this is complete,
use 'pulumi stack import' to import the repaired stack.
I think this particular timeout occurs when the fargate service cannot reach steady state, because the new task fails to startup, at least that was happening here because the container was erroring on boot.
Looks to be similar to https://github.com/pulumi/pulumi-awsx/issues/300.
We also constantly got into such issues (in https://github.com/ever-co/gauzy-pulumi) and I wonder if there is any way to increase timeout?
been having a similar issue lately
Updating (ef):
Type Name Status Info
pulumi:pulumi:Stack infra-ef **failed** 1 error
├─ awsx:x:ecs:FargateTaskDefinition app-svc2 1 warning
└─ awsx:x:ecs:FargateService app-svc2
~ └─ aws:ecs:Service app-svc2 **updating failed** 1 error
Diagnostics:
awsx:x:ecs:FargateTaskDefinition (app-svc2):
aws:ecs:Service (app-svc2):
error: Plan apply failed: 1 error occurred:
* updating urn:pulumi:ef::infra::awsx:x:ecs:FargateService$aws:ecs/service:Service::app-svc2: timeout while waiting for state to become 'true' (last state: 'false', timeout: 10m0s)
pulumi:pulumi:Stack (infra-ef):
error: update failed
Resources:
37 unchanged
Duration: 10m45s
Is there a solution to this issue? running again tends to not always work
Still having this issue!
We likely should increase the timeout defaults here. The default settings for other properties make it reasonably common that updates to the Fargate Service take more than 10 minutes due both to Fargate launch times and defaults for draining the old service during the update. In fact, in my experience it is very common for this process to take just slightly over 10 minutes in many cases - which makes the current 10 minute default timeout particularly frustrating.
This definitely needs to be addressed. I am getting this failure 4 out of 5 times. Everytime the cluster does actually reach the correct state, only Pulumi fails. I even tried adding this to the resource options:
customTimeouts: { update: '20m' }
But it has not effect.
The simplest workaround right now is to pass waitForSteadyState: false,
to the FargateService
. This will tell Pulumi not to wait for ECS to mark the Service as "ready" in the new deployed state. That does mean that other operations will continue even before the new service is serving traffic, but will avoid hitting this timeout.
The 10m
timeout here is hardcoded currently, and will need to be fixed upstream.
Unfortunately, it is very common for FargateService's backed by a load balancer to take just slightly over 10 minutes to update, due to the default helath checks and draining for NLB + Fargate.
For example, I reproed the issue with this timing of events from ECS - showing the update did actually complete in 10m47s even though the Pulumi deployment timed out at 10m. Note that there are two almost exactly 5m delays - one after starting the new task where NLB is waiting to mark the task as healthy, and another after the new task is ready where the old task is allowed to drain.
2020-06-25 17:41:09 -0700 service some-app-fef01c4 has reached a steady state.
2020-06-25 17:40:51 -0700 service some-app-fef01c4 has stopped 1 running tasks: task 5d2e29df-80d7-4456-ac79-59d134cd9c90.
2020-06-25 17:35:46 -0700 service some-app-fef01c4 has begun draining connections on 1 tasks.
2020-06-25 17:35:46 -0700 service some-app-fef01c4 deregistered 1 targets in target-group some-tg-94819bd
2020-06-25 17:31:07 -0700 service some-app-fef01c4 registered 1 targets in target-group some-tg-94819bd
2020-06-25 17:30:22 -0700 service some-app-fef01c4 has started 1 tasks: task be8bcdec-e2d6-4fb4-a0d2-9b4d9ab80316.
Because of these, other solutions are to pass deregistrationDelay: 0,
on the Target
(or a number smaller than the default 300), and/or set healthCheck
on the Target
and healthCheckGracePeriodSeconds
on the FargateService to allow it to register as healthy more quickly (if possible for your container). These are more complex, but have the benefit of actually speeding up how long it takes.
This should be fixed with https://github.com/pulumi/terraform-provider-aws/pull/59 which was released as part of @pulumi/aws
2.11.0
. Upgrading to that version should avoid hitting any timeouts here under normal conditions.
@lukehoban just heads up, I upgraded, and I am still getting the error. Only difference is the timout duration is longer:
timeout while waiting for state to become 'true' (last state: 'false', timeout: 15m0s)
Here is my resource def:
new awsx.ecs.FargateService(
`backend`,
{
cluster: cluster,
securityGroups: [serviceSG.id],
taskDefinitionArgs: {
taskRole: apiRole,
containers: {
api: {
image: image,
memory: 1024,
portMappings: [listener],
command: ['yarn', 'start:prod'],
environment: [
{
name: 'APP_ENV',
value: mapAppEnv(ENV),
},
],
},
},
},
desiredCount: 1,
},
{ dependsOn: [serviceSG, runMigrations] },
)
@duro I assume this is a case where the Fargate Service does ultimately succeed in it's update (even though Pulumi times out)? Could you share a log from your ECS console on the events that happened during the update? Greater than 15 minutes for an ECS Service update is surprising under any conditions I'm aware of - so would love to better understand so we can identify the right default here.
@lukehoban I'm a little unclear on which logs would be most valuable to get the right picture.
which logs would be most valuable to get the right picture
From the ECS console, on the Service page, the "Events" tab. Just the section of logs from when the Service started updating to when it reached a new "steady state" after the update. This should be logs that look like those copied into https://github.com/pulumi/pulumi-awsx/issues/354#issuecomment-649981532.
@lukehoban so now I'm having a hard time replicating this. Deploys seem to be completing inside the 15 minutes. I will post back here once (if) I see the issue again.
@lukehoban Ok, it's returned as an issue. Here are the logs from the Fargate service events tab:
2020-06-30 12:54:52 -0700 service backend-4095c50 has reached a steady state.
2020-06-30 12:54:30 -0700 service backend-4095c50 has stopped 1 running tasks: task b482b992-b073-4575-a504-941a5fc95b73.
2020-06-30 12:52:27 -0700 service backend-4095c50 has begun draining connections on 1 tasks.
2020-06-30 12:52:27 -0700 service backend-4095c50 deregistered 1 targets in target-group slynx-api-50ed6e4
2020-06-30 12:51:44 -0700 service backend-4095c50 registered 1 targets in target-group slynx-api-50ed6e4
2020-06-30 12:49:55 -0700 service backend-4095c50 has started 1 tasks: task 7784ffbd-007a-40e7-92ea-36ef074c1ba8.
The odd thing I'm seeing is that the start time vs steady state does seem to be less than 10min.
@lukehoban have you had a chance to look this over? Is there anything else I can send you to help investigate. I am still getting this error consistently.
@duro Sorry for the delay. The snippet you shared seems to show a case where the service fully completed the update in under 5 minutes. Are you sure that was an update that timed out after 15 minutes in Pulumi?
We can definitely consider bumping this higher than 15 minutes by default, but notably there is a user experience tradeoff where this may mean that certain failures for the service to become healthy take a longer time to cause the Service update to fail. So I'd love to better understand what leads to these cases where the update takes more than 15 minutes. If you have any more details you can share that would be helpful.
@lukehoban I am certain that in the case I sent you those were the logs from a run which failed due to timeout.
That said, 4 out of the last 5 runs have succeeded, but we still occasionally get these timeout related errors.
Is there anything in Palumi that could be factoring in dependencies of the service into this time?
For example, our Farget service has a dependency on a custom resource we created that runs DB migrations via a Fargate "task". Could it be that the time that task takes is also being counted by Palumi as part of the 15 minutes?
@lukehoban I believe i'm encountering this same issue however for me it was triggered by adding a taskRole to my taskDefinitionArgs.
this.fargateService = new awsx.ecs.FargateService(`${name}-app`, {
cluster: this.cluster,
assignPublicIp: false,
subnets: args.appSubnetIds,
securityGroups: appSecGroup,
desiredCount: (args.appResources || {}).desiredCount,
taskDefinitionArgs: {
taskRole: args.taskRole, // <-- This for me is the culprit.
containers: {
[name]: {
...args.appResources, // cpu, memory, etc.
image: args.appImage,
portMappings: [target],
environment: args.environment,
},
},
},
}, { parent: this.cluster });
If this is access related, is there a preferred method for extending the existing role which is assigned as a default by Pulumi? In my particular instance i've attempted to assign a custom role to allow SES access.
I managed to resolve my issue. I had the wrong service associated with the taskRole policy. It was set to ecs.amazonaws.com
instead of ecs-tasks.amazonaws.com
.
Hey. I'm seeing very similar behaviour in my pulumi task runner from azure devops.
ECS task takes >10mins to move to steady, but pulumi is time outing after 10 mins
ECS Events (~11 mins)
2020-12-14 16:36:44 +0100 service app-service has reached a steady state.
2020-12-14 16:36:44 +0100 service app-service (deployment ecs-svc/) deployment completed.
2020-12-14 16:36:25 +0100 service app-service has stopped 1 running tasks: task 111
2020-12-14 16:31:22 +0100 service app-service has begun draining connections on 1 tasks.
2020-12-14 16:31:22 +0100 service app-service deregistered 1 targets in target-group engine80-286ce9e
2020-12-14 16:26:49 +0100 service app-service registered 1 targets in target-group engine80-286ce9e
2020-12-14 16:25:39 +0100 service app-service has started 1 tasks: task 222
Azure devops pulumi version:
Starting: Run pulumi Service
==============================================================================
Task : Pulumi Azure Pipelines Task
Description : Azure Pipelines task extension for running Pulumi apps.
Version : 1.0.7
Author : Pulumi
Help : Join us on Slack at https://slack.pulumi.com.
==============================================================================
Downloading: https://get.pulumi.com/releases/sdk/pulumi-v2.15.6-linux-x64.tar.gz
Extracting archive
/bin/tar xzC /home/vsts/work/_temp/123 -f /home/vsts/work/_temp/123
Prepending PATH environment variable with directory: /home/vsts/work/_temp/123/pulumi
/home/vsts/work/_temp/123/pulumi/pulumi version
v2.15.6
Azure devops outcome:
aws:ecs:Service (app-service):
error: 1 error occurred:
* updating urn:pulumi:dev::stack-id::awsx:x:ecs:FargateService$aws:ecs/service:Service::app-service: 1 error occurred:
* error waiting for service (arn:arn/app-service) to reach a steady state: ResourceNotReady: exceeded wait attempts
pulumi:pulumi:Stack (stack-id):
error: update failed
Resources:
+ 2 created
~ 1 updated
+-2 replaced
5 changes. 59 unchanged
Duration: 10m15s
Pulumi nodejs packages version:
"@pulumi/aws": "3.19.2",
"@pulumi/awsx": "0.22.0",
"@pulumi/pulumi": "2.15.4",
Azure Devops pipelines using pulumi v2.15.4 were working fine last week.
Azure Devops pipelines using pulumi v2.15.4 were working fine last week.
Just to make sure - was these something that changed?
Duration: 10m15s
This definitely doesn't look to be expected - I will take another look into this.
Hey. Yes the pulumi version has changed. Last week pipelines were using v2.15.4 this week are using v2.15.6. But I forced azure devops pipeline to use v2.15.4 and it's time outing.
Seeing the same issue with this error:
error waiting for service to reach a steady state: ResourceNotReady: exceeded wait attempts
@lukehoban I'm on Pulumi v2.23.2 and getting this error just after 10m with a normal pulumi/aws
(not awsx) ECS deploy (non-fargate). Was the timeout change to 15m only for awsx
?
* error waiting for service (...) to reach a steady state: ResourceNotReady: exceeded wait attempts
Duration: 10m4s
@austinbutler This has actually changed upstream. Previously, there was a custom waiter and now it relies on the AWS SDK, which hardcodes a 10 minute timeout (see https://github.com/hashicorp/terraform-provider-aws/pull/3485).
Thanks @leezen. Are there any plans on Pulumi's end to support a custom waiter?
I think ideally, we'd prefer a way to have this work cleanly instead of reverting the upstream waiter logic. I've filed https://github.com/aws/aws-sdk-go/issues/3844 to help with that. Separately, I think we could consider updating some of the defaults in awsx
itself to make this less likely to happen.
I totally missed there are WithContext
methods, so I think we may want to consider using that in order to update the timeout.
Hey, I came here to report that this behavior still exists.
We are just trying to implement Pulumi and noticed that our ECS service constantly "breaks" our pipeline, by timing out after 45 minutes.
This happens on a creation of a new ECS service:
+ aws:ecs:Service test-service-pulumi creating (2714s) error: 1 error occurred:
+ aws:ecs:Service test-service-pulumi **creating failed** error: 1 error occurred:
@ updating....
pulumi:pulumi:Stack pulumi-test-service-test-service-qa running error: update failed
pulumi:pulumi:Stack pulumi-test-service-test-service-qa **failed** 1 error; 1 warning
Diagnostics:
aws:ecs:Service (test-service-pulumi):
error: 1 error occurred:
* creating urn:pulumi:test-service-qa::pulumi-test-service::awsx:ecs:FargateService$aws:ecs/service:Service::test-service-pulumi: 1 error occurred:
* error waiting for ECS service (arn:aws:ecs:eu-west-1:111222333444:service/qa-ecs-cluster/test-service-pulumi-f5a0d92) to reach steady state after creation: ResourceNotReady: exceeded wait attempts
Pulumi about:
Dependencies:
NAME VERSION
@pulumi/awsx 1.0.1
@pulumi/pulumi 3.50.2
@types/node 14.18.35
@pulumi/aws 5.24.0
We will most likely apply the waitForSteadyState: false
workaround.