pulumi-aws icon indicating copy to clipboard operation
pulumi-aws copied to clipboard

TestAccWebserverComp flake

Open t0yv0 opened this issue 9 months ago • 1 comments

See https://github.com/pulumi/pulumi-aws/issues/3895

aws:ec2:Instance (web-server-app):
  error: 1 error occurred:
  	* creating urn:pulumi:p-it-fv-az1019--webserver--0adc9e13::webserver_comp::aws:ec2/instance:Instance::web-server-app: 1 error occurred:
  	* waiting for EC2 Instance (i-032319ee5436b513c) create: timeout while waiting for state to become 'running' (last state: 'pending', timeout: 10m0s)

Could be a need for higher timeout, or could be a manifestation of test account health.

t0yv0 avatar May 07 '24 18:05 t0yv0

one more: https://github.com/pulumi/pulumi-aws/issues/3902

t0yv0 avatar May 07 '24 18:05 t0yv0

one more: https://github.com/pulumi/pulumi-aws/issues/3906

flostadler avatar May 08 '24 14:05 flostadler

Another two occurrences: https://github.com/pulumi/pulumi-aws/issues/3925

This is way too flaky now. I'm gonna dig into why it's starting to fail to often, seems like the startup of the EC2 instance isn't properly working because it doesn't go to running state

flostadler avatar May 11 '24 07:05 flostadler

My hunch is that this is caused by a combination of a very small instance type t2.nano and a very old AMI (from 2016, deprecated since 2022, no security patches anymore...). All the test failures mention the t2.nano instance failing to become healthy so I'm gonna try switching that for a t2.micro with a more recent AMI.

~ aws ec2 describe-images \
    --image-id ami-7172b611 \
    --region us-west-2 --profile pulumi-dev-sandbox

{
    "Images": [
        {
            "Architecture": "x86_64",
            "CreationDate": "2016-06-22T09:19:44.000Z",
            "ImageId": "ami-7172b611",
            "ImageLocation": "amazon/amzn-ami-hvm-2016.03.3.x86_64-gp2",
            "ImageType": "machine",
            "Public": true,
            "OwnerId": "137112412989",
            "PlatformDetails": "Linux/UNIX",
            "UsageOperation": "RunInstances",
            "State": "available",
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/xvda",
                    "Ebs": {
                        "DeleteOnTermination": true,
                        "SnapshotId": "snap-d465048a",
                        "VolumeSize": 8,
                        "VolumeType": "gp2",
                        "Encrypted": false
                    }
                }
            ],
            "Description": "Amazon Linux AMI 2016.03.3 x86_64 HVM GP2",
            "EnaSupport": true,
            "Hypervisor": "xen",
            "ImageOwnerAlias": "amazon",
            "Name": "amzn-ami-hvm-2016.03.3.x86_64-gp2",
            "RootDeviceName": "/dev/xvda",
            "RootDeviceType": "ebs",
            "SriovNetSupport": "simple",
            "VirtualizationType": "hvm",
            "DeprecationTime": "2022-08-24T23:59:59.000Z"
        }
    ]
}

Another reason could be that we're running out of CPU credits for the t2.nano instance. Upgrading to a slightly bigger instance should also help here. Sadly that's all just speculation because monitoring is disabled for those EC2 instances

flostadler avatar May 11 '24 07:05 flostadler

I upgraded the AMI of the failing test to AL2023. Let's see if this fixes the test

flostadler avatar May 14 '24 20:05 flostadler

It happened again: Just upgrading the AMI was not enough. I'll change the test to only start a t2.micro instance

flostadler avatar May 16 '24 12:05 flostadler

Sadness. Thanks for looking into this. Feel free to also skip the test while we investigate since the disturbance from flakiness outweighs the coverage we get out of it.

t0yv0 avatar May 16 '24 13:05 t0yv0