pulumi-awsx icon indicating copy to clipboard operation
pulumi-awsx copied to clipboard

Timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s)

Open t0yv0 opened this issue 2 years ago • 7 comments

What happened?

I'm receiving this error a lot when trying to test examples locally:


Diagnostics:
  pulumi:pulumi:Stack (ecs-node-p-it-antons-mac-nodejs-69324370):
    error: update failed

  aws:ecs:Service (my-service):
    error: 1 error occurred:
        * creating urn:pulumi:p-it-antons-mac-nodejs-69324370::ecs-node::awsx:ecs:FargateService$aws:ecs/service:Service::my-service: 1 error occurred:
        * waiting for ECS service (arn:aws:ecs:us-west-2:616138583583:service/cluster-e6c5e93/my-service-3c9c1de) to reach steady state after creation: timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s)

Outputs:
    url: "nginx-lb-f66cb5d-2145136225.us-west-2.elb.amazonaws.com"

Resources:
    + 33 created

Duration: 22m32s

This timeout happens when trying to record example baseline behavior, say for ecs/nodejs/ on AWS 5.42.0 and AWSX 1.x.x, but also when running examples on latest versions or the dependencies.

I have seen this affect the aws:ecs/service:Service through FargateService and other component resource wrappers.

For users affected by this issue, the current workaround per @danielrbradley is to apply a transformation that increases the custom timeout for the ECS service, see https://github.com/pulumi/pulumi-awsx/pull/1118 for a fully worked out example.

Please upvote this issue if this affects your workflow, and we can consider increasing default timeouts in the AWS provider.

Example

N/A

Output of pulumi about

CLI          
Version      3.86.0
Go Version   go1.21.1
Go Compiler  gc

Host     
OS       darwin
Version  14.0
Arch     x86_64

Backend        
Name           pulumi.com
URL            https://app.pulumi.com/t0yv0
User           t0yv0
Organizations  t0yv0, pulumi
Token type     personal

Pulumi locates its logs in /var/folders/gk/cchgxh512m72f_dmkcc3d09h0000gp/T/ by default

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

t0yv0 avatar Oct 23 '23 18:10 t0yv0

Possibly related:

https://github.com/pulumi/pulumi-awsx/issues/300 https://github.com/pulumi/pulumi-awsx/issues/391 https://github.com/pulumi/pulumi-awsx/issues/354

t0yv0 avatar Oct 24 '23 01:10 t0yv0

Following the links I've found this prior art:

https://github.com/pulumi/terraform-provider-aws/pull/59

One possibility here is to raise default timeouts again.

t0yv0 avatar Oct 24 '23 14:10 t0yv0

https://github.com/pulumi/terraform-provider-aws/pull/59 I've found has some prior art on editing default timeouts. Perhaps we could increase the values found in https://github.com/hashicorp/terraform-provider-aws/blob/master/internal/service/ecs/service.go#L50

t0yv0 avatar Oct 24 '23 14:10 t0yv0

I'm leaving this in the tracker to accumulate upvotes, and if it does we can circle back to pulumi-aws and increase default timeouts by patching upstream. For the moment issues with flaky tests and examples in this repository can be resolved by applying the custom timeout transformation suggested by @danielrbradley .

t0yv0 avatar Oct 24 '23 15:10 t0yv0

Looking into this further by checking in on the AWS console, I realized that this isn't really a timeout issue. The container fails to come up due to a configuration issue, and the provider gives up waiting after 20m so it looks like a timeout.

In this case, it was a Cloudwatch issue, but presumably it could be other reasons.

ResourceInitializationError: failed to validate logger args: create stream has been retried 7 times: failed to create Cloudwatch log stream:
RequestError: send request failed caused by: Post "https://logs.undefined.amazonaws.com/": 
dial tcp: lookup logs.undefined.amazonaws.com on 172.31.0.2:53: no such host : exit status 1

We should look into detecting such issues and notifying the user promptly and correctly.

thomas11 avatar Nov 01 '23 12:11 thomas11

I'm currently experencing this issue. I have temporarily resolved the issue by adding the following configuration values to my taskDefinition container values

const loggroup = new aws.cloudwatch.LogGroup(
  `testLoggroup`,
  {
    name: `testLoggroup`,
    retentionInDays: 7,
  }
);
logConfiguration: {
    logDriver: 'awslogs',
    options: {
      'awslogs-group': loggroup.name,
      'awslogs-region': 'us-east-1',
      'awslogs-stream-prefix': 'ecs',
    },
  },

If I understand correctly (which I may not, still learning a bunch) it seems like the awsx implementation of the fargate service needs to update how it handles logConfigurations and creating log groups when no logConfiguration is provided.

quinm0 avatar Jan 09 '24 21:01 quinm0

I ran into this as well, deployments kept timing out but then I would retry immediately and it would complete successfully almost instantly, yet the service was unavailable. Spent a good deal of time thinking it was some network config issue, but turns out that the whole thing was due to the task failing to start due to the missing log group issue.

I think 3 things could be improved here:

  1. Somehow fail fast by detecting the task/log error and reporting to user
  2. Somehow prevent identical deployment retry from succeeding since the service is not in fact in a healthy state
  3. Somehow updating the awsx.ecs module with better defaults to prevent the problem (or at least docs)

Frankly, I would prioritize 1 and 2, since they really gave me a sense of "spooky action", making it difficult to reason about how Pulumi works with AWS and eventually making me consider that there was something wrong with Pulumi.

lambdakris avatar Feb 02 '24 03:02 lambdakris

The underlying issue here is that the provider incorrectly determines the region as undefined. We can fix this by doing the following: https://github.com/pulumi/pulumi-aws-apigateway/commit/7ecbec74b91912859eaf1015232c22bc5d94d57f

flostadler avatar Sep 25 '24 10:09 flostadler

This issue has been addressed in PR #1384 and shipped in release v2.16.0.

pulumi-bot avatar Sep 26 '24 09:09 pulumi-bot