pulumi-aws
pulumi-aws copied to clipboard
TestAccWebserverComp flake
See https://github.com/pulumi/pulumi-aws/issues/3895
aws:ec2:Instance (web-server-app):
error: 1 error occurred:
* creating urn:pulumi:p-it-fv-az1019--webserver--0adc9e13::webserver_comp::aws:ec2/instance:Instance::web-server-app: 1 error occurred:
* waiting for EC2 Instance (i-032319ee5436b513c) create: timeout while waiting for state to become 'running' (last state: 'pending', timeout: 10m0s)
Could be a need for higher timeout, or could be a manifestation of test account health.
one more: https://github.com/pulumi/pulumi-aws/issues/3902
one more: https://github.com/pulumi/pulumi-aws/issues/3906
Another two occurrences: https://github.com/pulumi/pulumi-aws/issues/3925
This is way too flaky now. I'm gonna dig into why it's starting to fail to often, seems like the startup of the EC2 instance isn't properly working because it doesn't go to running state
My hunch is that this is caused by a combination of a very small instance type t2.nano
and a very old AMI (from 2016, deprecated since 2022, no security patches anymore...).
All the test failures mention the t2.nano
instance failing to become healthy so I'm gonna try switching that for a t2.micro with a more recent AMI.
~ aws ec2 describe-images \
--image-id ami-7172b611 \
--region us-west-2 --profile pulumi-dev-sandbox
{
"Images": [
{
"Architecture": "x86_64",
"CreationDate": "2016-06-22T09:19:44.000Z",
"ImageId": "ami-7172b611",
"ImageLocation": "amazon/amzn-ami-hvm-2016.03.3.x86_64-gp2",
"ImageType": "machine",
"Public": true,
"OwnerId": "137112412989",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-d465048a",
"VolumeSize": 8,
"VolumeType": "gp2",
"Encrypted": false
}
}
],
"Description": "Amazon Linux AMI 2016.03.3 x86_64 HVM GP2",
"EnaSupport": true,
"Hypervisor": "xen",
"ImageOwnerAlias": "amazon",
"Name": "amzn-ami-hvm-2016.03.3.x86_64-gp2",
"RootDeviceName": "/dev/xvda",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm",
"DeprecationTime": "2022-08-24T23:59:59.000Z"
}
]
}
Another reason could be that we're running out of CPU credits for the t2.nano instance. Upgrading to a slightly bigger instance should also help here. Sadly that's all just speculation because monitoring is disabled for those EC2 instances
I upgraded the AMI of the failing test to AL2023. Let's see if this fixes the test
It happened again: Just upgrading the AMI was not enough. I'll change the test to only start a t2.micro instance
Sadness. Thanks for looking into this. Feel free to also skip the test while we investigate since the disturbance from flakiness outweighs the coverage we get out of it.