pulumi-aws icon indicating copy to clipboard operation
pulumi-aws copied to clipboard

Interesting flake on TestAccDeleteBeforeCreate

Open t0yv0 opened this issue 11 months ago • 8 comments

Interesting flake on TestAccDeleteBeforeCreate:

  	* Retrieving AWS account details: validating provider credentials: retrieving caller identity from STS: 
  	operation error STS: GetCallerIdentity, https response error
        StatusCode: 403, RequestID: 8b1a26fa-db29-43dd-b705-869eadafa74c,
        api error ExpiredToken: The security token included in the request is expired

t0yv0 avatar Mar 14 '24 15:03 t0yv0

https://github.com/pulumi/pulumi-aws/issues/3655

t0yv0 avatar Mar 18 '24 20:03 t0yv0

Possibly a misconfiguration on our part? SO troubleshooting topicwith similar issues But the region is set globally for our tests.

This cron is flaking on the STS GetCallerIdentity credentials verification:

 Retrieving AWS account details: validating provider credentials: retrieving caller identity from STS: operation error STS: GetCallerIdentity, https response error StatusCode: 403, RequestID: <redacted>, api error ExpiredToken: The security token included in the request is expired

This is the same error throughout on the flakes

What is interesting is that this test never flakes on master.

I deleted the EC2 instance that failed to be deleted in our last run.

I wonder if it's an expiration issue since node tests take just a little over an hour to run? But then why is the STS validation only failing on this test, and only intermittently? Should we skip credentials validation for this test perhaps?

guineveresaenger avatar Mar 19 '24 01:03 guineveresaenger

UPDATE: this is now flaking on pull requests as well.

guineveresaenger avatar Mar 19 '24 01:03 guineveresaenger

Flakes on master in https://github.com/pulumi/pulumi-aws/issues/3636 fwiw.. Not sure what's going on here.

t0yv0 avatar Mar 19 '24 12:03 t0yv0

We saw some flakes like this in the service, where credentials would expire during the life of the job. We were using the key-rotator at the time -- futzing with that only got so far. We eventually rolled out some OIDC magic that let the job seamlessly assume a role with long-enough-lasting credentials. @kmosher would know more.

blampe avatar Mar 20 '24 19:03 blampe

Waiting on a runner is what is eating up most of the authenticated time:

2024-03-20T18:07:34.4133831Z Requested labels: ubuntu-latest
2024-03-20T18:07:34.4134259Z Job defined at: pulumi/pulumi-aws/.github/workflows/run-acceptance-tests.yml@refs/pull/3664/merge
2024-03-20T18:07:34.4134450Z Waiting for a runner to pick up this job...
2024-03-20T18:07:34.7221118Z Job is waiting for a hosted runner to come online.
2024-03-20T18:07:39.3277658Z Job is about to start running on the hosted runner: GitHub Actions 12 (hosted)

(the above is output from a running job after ~15 minutes)

guineveresaenger avatar Mar 20 '24 19:03 guineveresaenger

Fixed via #3666

guineveresaenger avatar Mar 22 '24 16:03 guineveresaenger

https://github.com/pulumi/pulumi-aws/actions/runs/8438448044 another instance

t0yv0 avatar Mar 26 '24 17:03 t0yv0

https://github.com/pulumi/ci-mgmt/pull/863 I think ultimately fixed it by doubling the timeout window.

t0yv0 avatar Mar 29 '24 20:03 t0yv0

Cannot close issue:

  • does not have an assignee

Please fix these problems and try again.

pulumi-bot avatar Mar 29 '24 20:03 pulumi-bot