pulumi-awsx
pulumi-awsx copied to clipboard
Timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s)
What happened?
I'm receiving this error a lot when trying to test examples locally:
Diagnostics:
pulumi:pulumi:Stack (ecs-node-p-it-antons-mac-nodejs-69324370):
error: update failed
aws:ecs:Service (my-service):
error: 1 error occurred:
* creating urn:pulumi:p-it-antons-mac-nodejs-69324370::ecs-node::awsx:ecs:FargateService$aws:ecs/service:Service::my-service: 1 error occurred:
* waiting for ECS service (arn:aws:ecs:us-west-2:616138583583:service/cluster-e6c5e93/my-service-3c9c1de) to reach steady state after creation: timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s)
Outputs:
url: "nginx-lb-f66cb5d-2145136225.us-west-2.elb.amazonaws.com"
Resources:
+ 33 created
Duration: 22m32s
This timeout happens when trying to record example baseline behavior, say for ecs/nodejs/ on AWS 5.42.0 and AWSX 1.x.x, but also when running examples on latest versions or the dependencies.
I have seen this affect the aws:ecs/service:Service through FargateService and other component resource wrappers.
For users affected by this issue, the current workaround per @danielrbradley is to apply a transformation that increases the custom timeout for the ECS service, see https://github.com/pulumi/pulumi-awsx/pull/1118 for a fully worked out example.
Please upvote this issue if this affects your workflow, and we can consider increasing default timeouts in the AWS provider.
Example
N/A
Output of pulumi about
CLI
Version 3.86.0
Go Version go1.21.1
Go Compiler gc
Host
OS darwin
Version 14.0
Arch x86_64
Backend
Name pulumi.com
URL https://app.pulumi.com/t0yv0
User t0yv0
Organizations t0yv0, pulumi
Token type personal
Pulumi locates its logs in /var/folders/gk/cchgxh512m72f_dmkcc3d09h0000gp/T/ by default
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
Possibly related:
https://github.com/pulumi/pulumi-awsx/issues/300 https://github.com/pulumi/pulumi-awsx/issues/391 https://github.com/pulumi/pulumi-awsx/issues/354
Following the links I've found this prior art:
https://github.com/pulumi/terraform-provider-aws/pull/59
One possibility here is to raise default timeouts again.
https://github.com/pulumi/terraform-provider-aws/pull/59 I've found has some prior art on editing default timeouts. Perhaps we could increase the values found in https://github.com/hashicorp/terraform-provider-aws/blob/master/internal/service/ecs/service.go#L50
I'm leaving this in the tracker to accumulate upvotes, and if it does we can circle back to pulumi-aws and increase default timeouts by patching upstream. For the moment issues with flaky tests and examples in this repository can be resolved by applying the custom timeout transformation suggested by @danielrbradley .
Looking into this further by checking in on the AWS console, I realized that this isn't really a timeout issue. The container fails to come up due to a configuration issue, and the provider gives up waiting after 20m so it looks like a timeout.
In this case, it was a Cloudwatch issue, but presumably it could be other reasons.
ResourceInitializationError: failed to validate logger args: create stream has been retried 7 times: failed to create Cloudwatch log stream:
RequestError: send request failed caused by: Post "https://logs.undefined.amazonaws.com/":
dial tcp: lookup logs.undefined.amazonaws.com on 172.31.0.2:53: no such host : exit status 1
We should look into detecting such issues and notifying the user promptly and correctly.
I'm currently experencing this issue. I have temporarily resolved the issue by adding the following configuration values to my taskDefinition container values
const loggroup = new aws.cloudwatch.LogGroup(
`testLoggroup`,
{
name: `testLoggroup`,
retentionInDays: 7,
}
);
logConfiguration: {
logDriver: 'awslogs',
options: {
'awslogs-group': loggroup.name,
'awslogs-region': 'us-east-1',
'awslogs-stream-prefix': 'ecs',
},
},
If I understand correctly (which I may not, still learning a bunch) it seems like the awsx implementation of the fargate service needs to update how it handles logConfigurations and creating log groups when no logConfiguration is provided.
I ran into this as well, deployments kept timing out but then I would retry immediately and it would complete successfully almost instantly, yet the service was unavailable. Spent a good deal of time thinking it was some network config issue, but turns out that the whole thing was due to the task failing to start due to the missing log group issue.
I think 3 things could be improved here:
- Somehow fail fast by detecting the task/log error and reporting to user
- Somehow prevent identical deployment retry from succeeding since the service is not in fact in a healthy state
- Somehow updating the awsx.ecs module with better defaults to prevent the problem (or at least docs)
Frankly, I would prioritize 1 and 2, since they really gave me a sense of "spooky action", making it difficult to reason about how Pulumi works with AWS and eventually making me consider that there was something wrong with Pulumi.
The underlying issue here is that the provider incorrectly determines the region as undefined. We can fix this by doing the following: https://github.com/pulumi/pulumi-aws-apigateway/commit/7ecbec74b91912859eaf1015232c22bc5d94d57f
This issue has been addressed in PR #1384 and shipped in release v2.16.0.