pulumi-aws
pulumi-aws copied to clipboard
Hang on getCallerIdentity when running pulumi up on stack with s3 bucket
What happened?
When deploying a known working stack without any changes through pulumi up the process will hang indefinitely in preview on:
I0216 13:22:00.181430 88382 log.go:71] eventSink::Debug(<{%reset%}>Registering resource: t=aws:s3/bucketObject:BucketObject, name=static-unit/dist/assets/libs/bootstrap-icons/icons-bug-fill.svg, custom=true, remote=false<{%reset%}>) I0216 13:22:00.343043 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.342997 88452 schema.go:864] Terraform output arn = {arn:aws:iam::REDACTED:REDACTED} <{%reset%}>) I0216 13:22:00.343084 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343009 88452 schema.go:864] Terraform output userId = {REDACTED} <{%reset%}>) I0216 13:22:00.343101 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343011 88452 schema.go:864] Terraform output id = {REDACTED} <{%reset%}>) I0216 13:22:00.343113 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343013 88452 schema.go:864] Terraform output accountId = {REDACTED} <{%reset%}>) I0216 13:22:00.343121 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343018 88452 rpc.go:74] Marshaling property for RPC[tf.Provider[aws].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: accountId={REDACTED} <{%reset%}>) I0216 13:22:00.343132 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343021 88452 rpc.go:74] Marshaling property for RPC[tf.Provider[aws].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: arn={arn:aws:iam::REDACTED:REDACTED} <{%reset%}>) I0216 13:22:00.343139 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343023 88452 rpc.go:74] Marshaling property for RPC[tf.Provider[aws].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: id={REDACTED} <{%reset%}>) I0216 13:22:00.343145 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343024 88452 rpc.go:74] Marshaling property for RPC[tf.Provider[aws].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: userId={REDACTED} <{%reset%}>) I0216 13:22:00.343162 88382 log.go:71] Unmarshaling property for RPC[Provider[aws, 0x140004f2eb0].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: accountId={REDACTED} I0216 13:22:00.343173 88382 log.go:71] Unmarshaling property for RPC[Provider[aws, 0x140004f2eb0].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: arn={arn:aws:iam::REDACTED:REDACTED} I0216 13:22:00.343179 88382 log.go:71] Unmarshaling property for RPC[Provider[aws, 0x140004f2eb0].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: id={REDACTED} I0216 13:22:00.343182 88382 log.go:71] Unmarshaling property for RPC[Provider[aws, 0x140004f2eb0].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: userId={REDACTED} I0216 13:22:00.343187 88382 log.go:71] Provider[aws, 0x140004f2eb0].Invoke(aws:index/getCallerIdentity:getCallerIdentity) success (#ret=4,#failures=0) success I0216 13:22:00.343194 88382 log.go:71] Marshaling property for RPC[ResourceMonitor.Invoke(aws:index/getCallerIdentity:getCallerIdentity)]: accountId={REDACTED} I0216 13:22:00.343198 88382 log.go:71] Marshaling property for RPC[ResourceMonitor.Invoke(aws:index/getCallerIdentity:getCallerIdentity)]: arn={arn:aws:iam::REDACTED:REDACTED} I0216 13:22:00.343202 88382 log.go:71] Marshaling property for RPC[ResourceMonitor.Invoke(aws:index/getCallerIdentity:getCallerIdentity)]: id={REDACTED} I0216 13:22:00.343205 88382 log.go:71] Marshaling property for RPC[ResourceMonitor.Invoke(aws:index/getCallerIdentity:getCallerIdentity)]: userId={REDACTED} ---- HANGS HERE ----
Expected Behavior
Expect either preview to complete or error message to be shown.
Steps to reproduce
- Run pulumi up on known working stack
- Experience hang
- region is eu-north-1
Output of pulumi about
CLI
Version 3.55.0
Go Version go1.19.5
Go Compiler gc
Plugins NAME VERSION aws 5.29.1 aws 5.10.0 command 0.5.2 docker 3.6.1 eks 0.42.7 kubernetes 3.22.2 kubernetes 3.20.2 kubernetes-cert-manager 0.0.3 nodejs unknown
Host
OS darwin
Version 13.1
Arch arm64
This project is written in nodejs: executable='/opt/homebrew/opt/node@16/bin/node' version='v16.19.0'
Backend
Name pulumi.com
URL https://app.pulumi.com/kimdanielarthur-alpinex
User kimdanielarthur-alpinex
Organizations kimdanielarthur-alpinex
Dependencies: NAME VERSION @pulumi/command 0.5.2 @pulumi/kubernetesx 0.1.6 patch-package 6.5.0 simple-sha256 1.1.0 cdk8s-cli 2.1.63 multimap 1.1.0 @pulumi/awsx 0.40.1 @pulumi/kubernetes-cert-manager 0.0.3 @pulumi/pulumi 3.48.0 @types/uuid 8.3.4 requestretry 7.1.0 @types/node 16.18.4 axios 0.27.2 @pulumi/aws 5.29.1 @pulumi/eks 0.42.7 @pulumi/kubernetes 3.22.2 @types/multimap 1.1.2
Pulumi locates its logs in /var/folders/l0/66wv34vs4hq4lpd4b6yk60k40000gn/T/ by default
Additional context
It seems to be related to an s3 bucket. When i remove this from the configuration it is able to proceed with the preview completely.
The steps I have taken to try to get passed this:
- Update pulumi to latest
- Update aws cli to latest
- Create new aws access token and aws configure
- pulumi config set aws:skipRequestingAccountId true
- pulumi config set aws:skipMetadataApiCheck true
- pulumi config set aws:skipCredentialsValidation true
- pulumi refresh <- runs to completion
- aws sts get-caller-identity <- returns as expected OK
- rm ~/.aws/credentials
- checked I have not conflicting env vars for aws tokens
- pulumi logout and login
- Able to create s3 bucket directly using aws s3api through cli
- deleted and reinstalled pulumi aws plugin
- created a completely new AWS user
- export and re-import stack
- Remove s3 bucket from configuration <- preview completes fully
- If i make a new stack I am able to have pulumi createa a new s3 bucket
- If I make a new s3 bucket in the same stack that is failing it is not able to create this s3 bucket <- it still hangs indefinetly
- validating that sts.eu-north-1.amazonaws.com resolves correctly in dns
- changed dns servers to see if timing or timeout issue
Are there any tips for further debug or actions to get past this stuck deployment?
Contributing
Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
Thanks for the detailed log and reproduction notes, very helpful!
This is reminscent of other problems we've seen with the NodeJS SDK (and pulumi-aws is mentioned, too): https://github.com/pulumi/pulumi/issues/12168 is the current one.
Are you able to provide a (minimal) program which shows the problem? Is it as simple as "create an S3 bucket, then call getCallerIdentity" -- if so, I can try to reproduce it here.
Thanks for reply!
There is a similarity to my issue here. I had over 2500 s3 BucketObjects.
In terms of standalone reproducability:
- Not able to reproduce behaviour in any new stack, it works fine in any new/other stack i have
- This has worked for a few months of maybe 50-100 deploys without any issues
- It randomly decided to start hanging
In the end i decided to not manage my S3 bucket with pulumi at all as it seems inefficient with all that overhead just to sync some static files to an s3 bucket.
So the only way I could make my stack deployable again was to remove the s3 bucket deployment from the stack.
Sorry that I cannot help with any further debugging, but there seems to be something lurking around that behaviour. Maybe some promises that never resolve and fail to trigger callback when there are many s3 BucketObjects?
Kim
Sorry that I cannot help with any further debugging, but there seems to be something lurking around that behaviour. Maybe some promises that never resolve and fail to trigger callback when there are many s3 BucketObjects?
It's all more clues :-)
Since you're not using Pulumi for the S3 bucket and its objects, does that mean you're not blocked on this issue? (Knowing that will help us prioritise)
Yes you are right, I unblocked by removing the s3 from this stack :)
I met the same hang issue caused by some null value set. By removing the null value settings, it's OK now.