pulumi-awsx icon indicating copy to clipboard operation
pulumi-awsx copied to clipboard

Fargate service updating failing with timeout

Open Ilchuk-Mihail opened this issue 5 years ago • 31 comments

I am using the following code to deploy container (some things with certificate, route53 dropped out to simplify )

 const nlb = new awsx.elasticloadbalancingv2.NetworkLoadBalancer("some-nlb", {external: true });
 const tg = nlb.createTargetGroup("some-tg", { port: 3000 });
 const listener = tg.createListener("some-http-listener", { port: 80 });
const listener2 = tg.createListener('https-listener', {
    certificateArn: certificate.arn,
     loadBalancerArn: nlb.arn,
    port: 443,
    protocol: "TLS",
    sslPolicy: "ELBSecurityPolicy-2016-08"
}); 

let service = new awsx.ecs.FargateService("some-app", {
    desiredCount: 1,
    taskDefinitionArgs: {
        containers: {
            myapp: {
                image: awsx.ecs.Image.fromPath("app", "../"),
                memory: 512,
                portMappings: [listener],
            },
        },
    },
});

It deploys correctly, but if I try to run it again, it stucks on the updating taskDefinition, and throw timeout error:

    ├─ awsx:x:ecs:FargateTaskDefinition  
+-  │  └─ aws:ecs:TaskDefinition         replaced                [diff: ~containerDefinitions]
    └─ awsx:x:ecs:FargateService         
~      └─ aws:ecs:Service                **updating failed**     [diff: ~taskDefinition]; 1 error


error: Plan apply failed: 1 error occurred:
       * updating urn:pulumi:dev::drjones::awsx:x:ecs:FargateService$aws:ecs/service:Service::service: timeout while waiting for state to become 'true' (last state: 'false', timeout: 10m0s)

And it always crashes on process updating taskDefinition, after this, I have +1 running task, but not replacing existing. But my changes are successfully applied (which I made in app code). So I see the changes which I made in container, but script fail and I have 2 tasks not one.

After some time this second task is stopped on aws. But script on the next pulumi up run tries to delete task from previous deployment.

Weird situation, which affects consistency, and makes it impossible to do updating correctly. Anyway it looks like some processing up to 10 min and then fail.

Ilchuk-Mihail avatar Jun 26 '19 22:06 Ilchuk-Mihail

Also ran into this with the following similar error message:

pulumi:pulumi:Stack (blah-api-qa):
    error: update failed

  aws:ecs:Service (qa-blah):
    error: Plan apply failed: 1 error occurred:
        * updating urn:pulumi:qa::blah-api::awsx:x:ecs:FargateService$aws:ecs/service:Service::qa-blah: timeout while waiting for state to become 'true' (last state: 'false', timeout: 10m0s)
    error: post-step event returned an error: failed to save snapshot: [403] The provided update token has expired.

Next pulumi up run tries to delete task def. from previous deployment. And fails with the following:

error: the current deployment has 1 resource(s) with pending operations:
  * urn:pulumi:qa::blah-api::awsx:x:ecs:FargateService$aws:ecs/service:Service::qa-blah, interrupted while updating

These resources are in an unknown state because the Pulumi CLI was interrupted while
waiting for changes to these resources to complete. You should confirm whether or not the
operations listed completed successfully by checking the state of the appropriate provider.
For example, if you are using AWS, you can confirm using the AWS Console.

Once you have confirmed the status of the interrupted operations, you can repair your stack
using 'pulumi stack export' to export your stack to a file. For each operation that succeeded,
remove that operation from the "pending_operations" section of the file. Once this is complete,
use 'pulumi stack import' to import the repaired stack.

Blitz2145 avatar Aug 26 '19 18:08 Blitz2145

I think this particular timeout occurs when the fargate service cannot reach steady state, because the new task fails to startup, at least that was happening here because the container was erroring on boot.

Blitz2145 avatar Aug 26 '19 19:08 Blitz2145

Looks to be similar to https://github.com/pulumi/pulumi-awsx/issues/300.

We also constantly got into such issues (in https://github.com/ever-co/gauzy-pulumi) and I wonder if there is any way to increase timeout?

evereq avatar Sep 22 '19 10:09 evereq

been having a similar issue lately

Updating (ef):

     Type                                 Name      Status                  Info
     pulumi:pulumi:Stack                  infra-ef  **failed**              1 error
     ├─ awsx:x:ecs:FargateTaskDefinition  app-svc2                          1 warning
     └─ awsx:x:ecs:FargateService         app-svc2
 ~      └─ aws:ecs:Service                app-svc2  **updating failed**     1 error

Diagnostics:
  awsx:x:ecs:FargateTaskDefinition (app-svc2):

  aws:ecs:Service (app-svc2):
    error: Plan apply failed: 1 error occurred:
        * updating urn:pulumi:ef::infra::awsx:x:ecs:FargateService$aws:ecs/service:Service::app-svc2: timeout while waiting for state to become 'true' (last state: 'false', timeout: 10m0s)

  pulumi:pulumi:Stack (infra-ef):
    error: update failed

Resources:
    37 unchanged

Duration: 10m45s

Is there a solution to this issue? running again tends to not always work

kingwill101 avatar Nov 15 '19 14:11 kingwill101

Still having this issue!

nachomazzara avatar Jun 13 '20 14:06 nachomazzara

We likely should increase the timeout defaults here. The default settings for other properties make it reasonably common that updates to the Fargate Service take more than 10 minutes due both to Fargate launch times and defaults for draining the old service during the update. In fact, in my experience it is very common for this process to take just slightly over 10 minutes in many cases - which makes the current 10 minute default timeout particularly frustrating.

lukehoban avatar Jun 25 '20 03:06 lukehoban

This definitely needs to be addressed. I am getting this failure 4 out of 5 times. Everytime the cluster does actually reach the correct state, only Pulumi fails. I even tried adding this to the resource options:

customTimeouts: { update: '20m' }

But it has not effect.

duro avatar Jun 25 '20 19:06 duro

The simplest workaround right now is to pass waitForSteadyState: false, to the FargateService. This will tell Pulumi not to wait for ECS to mark the Service as "ready" in the new deployed state. That does mean that other operations will continue even before the new service is serving traffic, but will avoid hitting this timeout.

The 10m timeout here is hardcoded currently, and will need to be fixed upstream.

Unfortunately, it is very common for FargateService's backed by a load balancer to take just slightly over 10 minutes to update, due to the default helath checks and draining for NLB + Fargate.

For example, I reproed the issue with this timing of events from ECS - showing the update did actually complete in 10m47s even though the Pulumi deployment timed out at 10m. Note that there are two almost exactly 5m delays - one after starting the new task where NLB is waiting to mark the task as healthy, and another after the new task is ready where the old task is allowed to drain.

2020-06-25 17:41:09 -0700 service some-app-fef01c4 has reached a steady state.
2020-06-25 17:40:51 -0700 service some-app-fef01c4 has stopped 1 running tasks: task 5d2e29df-80d7-4456-ac79-59d134cd9c90.
2020-06-25 17:35:46 -0700 service some-app-fef01c4 has begun draining connections on 1 tasks.
2020-06-25 17:35:46 -0700 service some-app-fef01c4 deregistered 1 targets in target-group some-tg-94819bd
2020-06-25 17:31:07 -0700 service some-app-fef01c4 registered 1 targets in target-group some-tg-94819bd
2020-06-25 17:30:22 -0700 service some-app-fef01c4 has started 1 tasks: task be8bcdec-e2d6-4fb4-a0d2-9b4d9ab80316.

Because of these, other solutions are to pass deregistrationDelay: 0, on the Target (or a number smaller than the default 300), and/or set healthCheck on the Target and healthCheckGracePeriodSeconds on the FargateService to allow it to register as healthy more quickly (if possible for your container). These are more complex, but have the benefit of actually speeding up how long it takes.

lukehoban avatar Jun 26 '20 05:06 lukehoban

This should be fixed with https://github.com/pulumi/terraform-provider-aws/pull/59 which was released as part of @pulumi/aws 2.11.0. Upgrading to that version should avoid hitting any timeouts here under normal conditions.

lukehoban avatar Jun 26 '20 23:06 lukehoban

@lukehoban just heads up, I upgraded, and I am still getting the error. Only difference is the timout duration is longer:

timeout while waiting for state to become 'true' (last state: 'false', timeout: 15m0s)

Here is my resource def:

new awsx.ecs.FargateService(
  `backend`,
  {
    cluster: cluster,
    securityGroups: [serviceSG.id],
    taskDefinitionArgs: {
      taskRole: apiRole,
      containers: {
        api: {
          image: image,
          memory: 1024,
          portMappings: [listener],
          command: ['yarn', 'start:prod'],
          environment: [
            {
              name: 'APP_ENV',
              value: mapAppEnv(ENV),
            },
          ],
        },
      },
    },
    desiredCount: 1,
  },
  { dependsOn: [serviceSG, runMigrations] },
)

duro avatar Jun 28 '20 21:06 duro

@duro I assume this is a case where the Fargate Service does ultimately succeed in it's update (even though Pulumi times out)? Could you share a log from your ECS console on the events that happened during the update? Greater than 15 minutes for an ECS Service update is surprising under any conditions I'm aware of - so would love to better understand so we can identify the right default here.

lukehoban avatar Jun 28 '20 21:06 lukehoban

@lukehoban I'm a little unclear on which logs would be most valuable to get the right picture.

duro avatar Jun 28 '20 21:06 duro

which logs would be most valuable to get the right picture

From the ECS console, on the Service page, the "Events" tab. Just the section of logs from when the Service started updating to when it reached a new "steady state" after the update. This should be logs that look like those copied into https://github.com/pulumi/pulumi-awsx/issues/354#issuecomment-649981532.

Screen Shot 2020-06-28 at 2 58 29 PM

lukehoban avatar Jun 28 '20 22:06 lukehoban

@lukehoban so now I'm having a hard time replicating this. Deploys seem to be completing inside the 15 minutes. I will post back here once (if) I see the issue again.

duro avatar Jun 29 '20 21:06 duro

@lukehoban Ok, it's returned as an issue. Here are the logs from the Fargate service events tab:

2020-06-30 12:54:52 -0700 service backend-4095c50 has reached a steady state.
2020-06-30 12:54:30 -0700 service backend-4095c50 has stopped 1 running tasks: task b482b992-b073-4575-a504-941a5fc95b73.
2020-06-30 12:52:27 -0700 service backend-4095c50 has begun draining connections on 1 tasks.
2020-06-30 12:52:27 -0700 service backend-4095c50 deregistered 1 targets in target-group slynx-api-50ed6e4
2020-06-30 12:51:44 -0700 service backend-4095c50 registered 1 targets in target-group slynx-api-50ed6e4
2020-06-30 12:49:55 -0700 service backend-4095c50 has started 1 tasks: task 7784ffbd-007a-40e7-92ea-36ef074c1ba8.

The odd thing I'm seeing is that the start time vs steady state does seem to be less than 10min.

duro avatar Jul 01 '20 01:07 duro

@lukehoban have you had a chance to look this over? Is there anything else I can send you to help investigate. I am still getting this error consistently.

duro avatar Jul 07 '20 19:07 duro

@duro Sorry for the delay. The snippet you shared seems to show a case where the service fully completed the update in under 5 minutes. Are you sure that was an update that timed out after 15 minutes in Pulumi?

We can definitely consider bumping this higher than 15 minutes by default, but notably there is a user experience tradeoff where this may mean that certain failures for the service to become healthy take a longer time to cause the Service update to fail. So I'd love to better understand what leads to these cases where the update takes more than 15 minutes. If you have any more details you can share that would be helpful.

lukehoban avatar Jul 24 '20 00:07 lukehoban

@lukehoban I am certain that in the case I sent you those were the logs from a run which failed due to timeout.

That said, 4 out of the last 5 runs have succeeded, but we still occasionally get these timeout related errors.

Is there anything in Palumi that could be factoring in dependencies of the service into this time?

For example, our Farget service has a dependency on a custom resource we created that runs DB migrations via a Fargate "task". Could it be that the time that task takes is also being counted by Palumi as part of the 15 minutes?

duro avatar Jul 24 '20 19:07 duro

@lukehoban I believe i'm encountering this same issue however for me it was triggered by adding a taskRole to my taskDefinitionArgs.

this.fargateService = new awsx.ecs.FargateService(`${name}-app`, {
            cluster: this.cluster,
            assignPublicIp: false,
            subnets: args.appSubnetIds,
            securityGroups: appSecGroup,
            desiredCount: (args.appResources || {}).desiredCount,

            taskDefinitionArgs: {
                taskRole: args.taskRole, // <-- This for me is the culprit.
                containers: {
                    [name]: {
                        ...args.appResources, // cpu, memory, etc.
                        image: args.appImage,
                        portMappings: [target],
                        environment: args.environment,
                        
                    },
                },
            },
        }, { parent: this.cluster });

If this is access related, is there a preferred method for extending the existing role which is assigned as a default by Pulumi? In my particular instance i've attempted to assign a custom role to allow SES access.

jgrocott avatar Aug 26 '20 11:08 jgrocott

I managed to resolve my issue. I had the wrong service associated with the taskRole policy. It was set to ecs.amazonaws.com instead of ecs-tasks.amazonaws.com.

jgrocott avatar Aug 26 '20 12:08 jgrocott

Hey. I'm seeing very similar behaviour in my pulumi task runner from azure devops.

ECS task takes >10mins to move to steady, but pulumi is time outing after 10 mins

ECS Events (~11 mins)

2020-12-14 16:36:44 +0100 service app-service has reached a steady state.
2020-12-14 16:36:44 +0100 service app-service (deployment ecs-svc/) deployment completed.
2020-12-14 16:36:25 +0100 service app-service has stopped 1 running tasks: task 111
2020-12-14 16:31:22 +0100 service app-service has begun draining connections on 1 tasks.
2020-12-14 16:31:22 +0100 service app-service deregistered 1 targets in target-group engine80-286ce9e
2020-12-14 16:26:49 +0100 service app-service registered 1 targets in target-group engine80-286ce9e
2020-12-14 16:25:39 +0100 service app-service has started 1 tasks: task 222

Azure devops pulumi version:

Starting: Run pulumi  Service
==============================================================================
Task         : Pulumi Azure Pipelines Task
Description  : Azure Pipelines task extension for running Pulumi apps.
Version      : 1.0.7
Author       : Pulumi
Help         : Join us on Slack at https://slack.pulumi.com.
==============================================================================
Downloading: https://get.pulumi.com/releases/sdk/pulumi-v2.15.6-linux-x64.tar.gz
Extracting archive
/bin/tar xzC /home/vsts/work/_temp/123 -f /home/vsts/work/_temp/123

Prepending PATH environment variable with directory: /home/vsts/work/_temp/123/pulumi
/home/vsts/work/_temp/123/pulumi/pulumi version
v2.15.6

Azure devops outcome:

  aws:ecs:Service (app-service):
    error: 1 error occurred:
    	* updating urn:pulumi:dev::stack-id::awsx:x:ecs:FargateService$aws:ecs/service:Service::app-service: 1 error occurred:
    	* error waiting for service (arn:arn/app-service) to reach a steady state: ResourceNotReady: exceeded wait attempts
 
  pulumi:pulumi:Stack (stack-id):
    error: update failed
 
Resources:
    + 2 created
    ~ 1 updated
    +-2 replaced
    5 changes. 59 unchanged

Duration: 10m15s

Pulumi nodejs packages version:

    "@pulumi/aws": "3.19.2",
    "@pulumi/awsx": "0.22.0",
    "@pulumi/pulumi": "2.15.4",

Azure Devops pipelines using pulumi v2.15.4 were working fine last week.

jacbar avatar Dec 14 '20 16:12 jacbar

Azure Devops pipelines using pulumi v2.15.4 were working fine last week.

Just to make sure - was these something that changed?

Duration: 10m15s

This definitely doesn't look to be expected - I will take another look into this.

lukehoban avatar Dec 14 '20 17:12 lukehoban

Hey. Yes the pulumi version has changed. Last week pipelines were using v2.15.4 this week are using v2.15.6. But I forced azure devops pipeline to use v2.15.4 and it's time outing.

jacbar avatar Dec 14 '20 17:12 jacbar

Seeing the same issue with this error:

 error waiting for service to reach a steady state: ResourceNotReady: exceeded wait attempts

volkanunsal avatar Jan 16 '21 22:01 volkanunsal

@lukehoban I'm on Pulumi v2.23.2 and getting this error just after 10m with a normal pulumi/aws (not awsx) ECS deploy (non-fargate). Was the timeout change to 15m only for awsx?

* error waiting for service (...) to reach a steady state: ResourceNotReady: exceeded wait attempts

Duration: 10m4s

austinbutler avatar Mar 30 '21 01:03 austinbutler

@austinbutler This has actually changed upstream. Previously, there was a custom waiter and now it relies on the AWS SDK, which hardcodes a 10 minute timeout (see https://github.com/hashicorp/terraform-provider-aws/pull/3485).

leezen avatar Mar 30 '21 16:03 leezen

Thanks @leezen. Are there any plans on Pulumi's end to support a custom waiter?

austinbutler avatar Mar 30 '21 17:03 austinbutler

I think ideally, we'd prefer a way to have this work cleanly instead of reverting the upstream waiter logic. I've filed https://github.com/aws/aws-sdk-go/issues/3844 to help with that. Separately, I think we could consider updating some of the defaults in awsx itself to make this less likely to happen.

leezen avatar Mar 31 '21 02:03 leezen

I totally missed there are WithContext methods, so I think we may want to consider using that in order to update the timeout.

leezen avatar Apr 01 '21 17:04 leezen

Hey, I came here to report that this behavior still exists.

We are just trying to implement Pulumi and noticed that our ECS service constantly "breaks" our pipeline, by timing out after 45 minutes.

This happens on a creation of a new ECS service:

 +  aws:ecs:Service test-service-pulumi creating (2714s) error: 1 error occurred:
 +  aws:ecs:Service test-service-pulumi **creating failed** error: 1 error occurred:
@ updating....
    pulumi:pulumi:Stack pulumi-test-service-test-service-qa running error: update failed
    pulumi:pulumi:Stack pulumi-test-service-test-service-qa **failed** 1 error; 1 warning

Diagnostics:
  aws:ecs:Service (test-service-pulumi):
    error: 1 error occurred:
    	* creating urn:pulumi:test-service-qa::pulumi-test-service::awsx:ecs:FargateService$aws:ecs/service:Service::test-service-pulumi: 1 error occurred:

    	* error waiting for ECS service (arn:aws:ecs:eu-west-1:111222333444:service/qa-ecs-cluster/test-service-pulumi-f5a0d92) to reach steady state after creation: ResourceNotReady: exceeded wait attempts

Pulumi about:

Dependencies:
NAME            VERSION
@pulumi/awsx    1.0.1
@pulumi/pulumi  3.50.2
@types/node     14.18.35
@pulumi/aws     5.24.0

We will most likely apply the waitForSteadyState: false workaround.

sivanovnr avatar May 25 '23 11:05 sivanovnr