aws-sdk-rust
aws-sdk-rust copied to clipboard
Env vars to configure IMDS retry and timeouts
Describe the feature
Other official AWS SDK libraries support specifying AWS_METADATA_SERVICE_NUM_ATTEMPTS and AWS_METADATA_SERVICE_TIMEOUT environment variables to automatically retry IMDS requests.
This feature is currently missing in the aws-sdk-rust making it more difficult for applications to handle rare credential failures originating from IMDS requests.
Use Case
When applications are deployed in Kubernetes cluster with KIAM project intercepting and redirecting IMDS requests, it can be flaky.
Applications such as Vector will out right abort and throw-away the current operation when it encounters an IMDS credential error. It would be super beneficial to have the retry abilities built-in to the SDK itself.
Proposed Solution
Implement retry logic into the library with AWS_METADATA_SERVICE_NUM_ATTEMPTS and AWS_METADATA_SERVICE_TIMEOUT environment variables support.
Other Information
No response
Acknowledgements
- [x] I may be able to implement this feature request
- [ ] This feature might incur a breaking change
A note for the community
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue, please leave a comment
Like @kevinpark1217 mentioned our calls to KIAM are proxied by KIAM, and while KIAM is in maintenance mode we're stuck on it for the time being.
It appears like the responsiveness of IMDS can depend on load, so supporting these env vars could help users increase reliability when something running the SDK is sharing an instance with a caller that is hitting IMDS hard.
I think it's possible to adjust the timeout in code, but supporting the env vars would make it easy for everyone to tweak these settings for all the downstream applications of the SDK.
It looks like the Go SDK also has this feature request: https://github.com/aws/aws-sdk-go/issues/3495
Seems like these env vars need to be standardized across the SDKs.
looks like the support across the different SDKs is tracked here https://docs.aws.amazon.com/sdkref/latest/guide/feature-ec2-instance-metadata.html
I would be interested in working on this, couple of clarifying questions to the team:
- ec2 metadata page linked above defines default max_retries should be 1 but current value is 4, should this be changed to 1 to be consistent cross SDKs?
- Does the AWS_METADATA_SERVICE_TIMEOUT represent both connection and read timeout? i.e. if set to 5, connection timeout would be set to 5 and read timeout would be set to 5?
- builder already exposes connection_timeout read_timeout functions but does not use those values anywhere, how do you see the single timeout environment configuration interacting with those two configurations?
- I'm thinking I would take on fixing this in the PR as well.
@Sigurthorb Oops, I didn't see your comment on taking on this issue. I took a stab at it with #626 PR.
should this be changed to 1 to be consistent cross SDKs?
I think I should probably change this to 1 for consistency. Let me know if anyone else has opinions on this.
My question for the reviewers, what's the standard way of handling parsing error in the AWS sdk?
- Fail outright and terminate
- Display warning, but use default values
- Use default values silently
You technically called first dibs with your checkmark to I may be able to implement this feature request and you reported it. Good luck!
@Sigurthorb Oops, I didn't see your comment on taking on this issue. I took a stab at it with #626 PR.
should this be changed to 1 to be consistent cross SDKs?
I think I should probably change this to 1 for consistency. Let me know if anyone else has opinions on this.
My question for the reviewers, what's the standard way of handling parsing error in the AWS sdk?
1. Fail outright and terminate 2. Display warning, but use default values 3. Use default values silently
When encountering parsing errors, we typically disregard the config source that produced them and move on to other config sources in the provider chain. If all configs are invalid, then we will either set a default or panic depending on the specific config variable.
@Velfi Sorry for taking a while to follow up on this. I was preparing the slides for the AWS reInvent presentation.
I have opened a new PR in the smithy-rs repo. Based on your error handling suggestion, I simplified the code a little bit. For this initial PR, I left the EndpointSource struct untouched for code readability and initial review.
I think it would be good to implement this feature, since Python SDK already supports this. But also, the IMDS retry isn't well exposed for programmatically configuring by other parts of the Rust SDK.
https://github.com/awslabs/smithy-rs/pull/1821
Also, there appears to be a regression where
self.read_timeout and self.connect_timeout fields is now unused in the imds/client.rs
Checkout here. (takes a while to auto scroll to the correct position)
Edit: Opened an issue and a PR (https://github.com/awslabs/smithy-rs/issues/1822)
I have found another way to implement timeout and retry attempts through code rather than environment variables. Please check out https://github.com/awslabs/smithy-rs/pull/1867
@kevinpark1217 sorry for the slowness, we're still working on getting the next version released.
@kevinpark1217 - Your changes to override the IMDS client have been released with release-2022-10-26.
I'm going to leave this feature request open for now to track interest in this configuration as environment variables.